Backtesting Robustness

Robustness testing asks whether a strategy still looks credible when the framework stops optimizing on the same data used for evaluation. In options research this matters because small sample sizes, changing liquidity, and many profile variants can produce attractive but fragile results.

Walk-forward validation

A basic walk-forward process splits time into train and out-of-sample windows:

Generate candidate profiles on the training window.
Select profiles using predeclared metrics and gates.
Run the selected profile on the next out-of-sample window.
Move the window forward and repeat.
Aggregate only the out-of-sample trades for final evaluation.

The simulator rules must stay fixed between train and test. Changing fill rules, DTE windows, or contract availability after seeing a test result invalidates the comparison.

Holdout and nested selection

Use a holdout window when a family has already been explored heavily. Use nested selection when many profiles compete inside each outer fold. The inner loop chooses a profile; the outer loop measures what that choice would have done out of sample.

This discipline is slower than one big backtest, but it answers a better question: "Would the process have selected something useful before seeing the future?"

Portfolio metrics

When multiple symbols can trade on the same day, aggregate PnL by calendar day before computing portfolio risk metrics. Do not flatten per-symbol daily returns into separate fake days.

The trade log can have many rows per day. The risk series should represent what the portfolio experienced:

bash

from collections import defaultdict

daily_pnl = defaultdict(float)
for trade in trade_log:
    daily_pnl[trade["entry_day"]] += trade["pnl"]

daily_returns = [
    pnl / initial_equity
    for day, pnl in sorted(daily_pnl.items())
]

This prevents multi-symbol research from looking smoother than the actual calendar path.

Diagnostics

Useful robustness diagnostics include:

Diagnostic	Purpose
Out-of-sample return	Measures result after selection.
Sharpe and Sortino	Measures daily return quality, preferably from realized portfolio days.
Max drawdown	Measures path risk.
Trade count and trades per week	Prevents sparse lucky profiles from dominating.
Coverage ratio	Shows how often the framework had enough data to test.
PBO	Estimates probability of backtest overfitting across profile combinations.
Deflated Sharpe	Adjusts a Sharpe-like result for multiple testing and non-normality.
Overlap	Checks whether a new profile is just a duplicate of an existing one.

Promotion gates

Treat gates as research controls, not marketing hurdles. A profile can be profitable and still fail if it has too few trades, poor data coverage, high drawdown, unstable folds, or excessive overlap with a stronger profile.

A useful summary should include both winners and blockers:

selected profile
rejected finalists
failed checks
fold-by-fold metrics
option availability diagnostics
execution rejection counts
final trade-level export

Backtesting Framework

Docs