CuteMarkets Docs

Backtesting Framework

Framework guides for engineers building realistic options backtests with causal data, quote-aware fills, and robust validation.

Tip: open /docs/backtesting-robustness.md directly for raw markdown (easy copy/paste into an LLM).

Backtesting Robustness

Robustness testing asks whether a strategy still looks credible when the framework stops optimizing on the same data used for evaluation. In options research this matters because small sample sizes, changing liquidity, and many profile variants can produce attractive but fragile results.

Walk-forward validation

A basic walk-forward process splits time into train and out-of-sample windows:

  1. Generate candidate profiles on the training window.
  2. Select profiles using predeclared metrics and gates.
  3. Run the selected profile on the next out-of-sample window.
  4. Move the window forward and repeat.
  5. Aggregate only the out-of-sample trades for final evaluation.

The simulator rules must stay fixed between train and test. Changing fill rules, DTE windows, or contract availability after seeing a test result invalidates the comparison.

Holdout and nested selection

Use a holdout window when a family has already been explored heavily. Use nested selection when many profiles compete inside each outer fold. The inner loop chooses a profile; the outer loop measures what that choice would have done out of sample.

This discipline is slower than one big backtest, but it answers a better question: "Would the process have selected something useful before seeing the future?"

Portfolio metrics

When multiple symbols can trade on the same day, aggregate PnL by calendar day before computing portfolio risk metrics. Do not flatten per-symbol daily returns into separate fake days.

The trade log can have many rows per day. The risk series should represent what the portfolio experienced:

bash
from collections import defaultdict

daily_pnl = defaultdict(float)
for trade in trade_log:
    daily_pnl[trade["entry_day"]] += trade["pnl"]

daily_returns = [
    pnl / initial_equity
    for day, pnl in sorted(daily_pnl.items())
]

This prevents multi-symbol research from looking smoother than the actual calendar path.

Diagnostics

Useful robustness diagnostics include:

DiagnosticPurpose
Out-of-sample returnMeasures result after selection.
Sharpe and SortinoMeasures daily return quality, preferably from realized portfolio days.
Max drawdownMeasures path risk.
Trade count and trades per weekPrevents sparse lucky profiles from dominating.
Coverage ratioShows how often the framework had enough data to test.
PBOEstimates probability of backtest overfitting across profile combinations.
Deflated SharpeAdjusts a Sharpe-like result for multiple testing and non-normality.
OverlapChecks whether a new profile is just a duplicate of an existing one.

Promotion gates

Treat gates as research controls, not marketing hurdles. A profile can be profitable and still fail if it has too few trades, poor data coverage, high drawdown, unstable folds, or excessive overlap with a stronger profile.

A useful summary should include both winners and blockers:

  • selected profile
  • rejected finalists
  • failed checks
  • fold-by-fold metrics
  • option availability diagnostics
  • execution rejection counts
  • final trade-level export

Read next: Backtesting Test Plan, Backtesting Execution Realism, and Strategy Robustness Testing.

Next steps

Move from the docs into the product workflow

If you are evaluating the API rather than implementing a specific endpoint right now, the product pages map the live, historical, and chain workflows directly.