HomeBlogThe Developer's First Backtesting Loop: Start With Evidence, Not Optimism
Developer GuideMay 9, 2026·8 min read

The Developer's First Backtesting Loop: Start With Evidence, Not Optimism

CuteMarkets

CuteMarkets Team

Research

Quick answer

The Developer's First Backtesting Loop: Start With Evidence, Not Optimism

A first serious backtesting loop should define the signal timestamp, reconstruct the tradable contract universe, price fills from observable quotes, and log every rejection before optimization begins.

The Developer's First Backtesting Loop: Start With Evidence, Not Optimism

The Developer's First Backtesting Loop: Start With Evidence, Not Optimism

Abstract

The first mistake a developer makes in trading research is usually not a bad indicator. It is building a loop that answers the wrong question. A quick script can say whether a rule would have made money, but a useful backtest has to say whether the rule could have made those decisions with the information available at the time.

CuteMarkets research has become stricter for that reason. The most useful backtesting loop is not a chart-first loop. It is a data-contract loop: define the signal timestamp, reconstruct the tradable instrument, price the entry and exit from observable market data, then record every rejection.

The Small Loop

Start with one strategy family, one symbol group, and one entry rule. Resist the urge to build a dashboard before the replay is honest. A good first loop should answer five questions.

  1. Which completed bar or event created the signal?
  2. Which contracts were actually listed then?
  3. Which quote or trade evidence existed near the entry?
  4. Which costs, spreads, and rejects were applied?
  5. Which artifact proves the run can be repeated?

That is enough for a first serious system. It gives you a result, but it also gives you a reason to distrust the result when something is missing.

Define The Data Contract Before The Strategy

The useful design habit is to define the research contract before writing the clever part of the strategy. A contract is a small statement of what the simulator is allowed to know and what evidence it must preserve. For example: the signal may use completed underlying bars up to t, contract discovery must be as-of the session date, fills must use quotes observed after the signal timestamp, and the run must write selected trades plus rejected candidates.

This contract does not need to be complicated. It needs to be explicit. Developers coming from web or backend work already understand this pattern: a service boundary is easier to maintain when input and output shapes are clear. Backtesting is the same. The signal layer produces an intent. The selection layer turns that intent into an instrument. The execution layer decides whether the instrument was tradable. The reporting layer explains what happened.

Without those boundaries, debugging becomes ambiguous. If the PnL improves, you cannot tell whether the signal became stronger, the selector started choosing better contracts, or the fill model quietly became easier. With boundaries, every improvement has a location.

Start With A Baseline That Is Intentionally Boring

A first backtest should include at least one boring baseline. Compare the strategy against a naive timing rule, a random-entry control with the same holding period, or a version that keeps the same contract selection but removes the signal trigger. The goal is not to prove that the strategy is good on day one. The goal is to learn whether the code can distinguish signal contribution from market drift and selection artifacts.

This is especially important in options. A backtest can look useful because it repeatedly selects high-convexity contracts during active sessions. That may be an expression effect rather than a signal effect. A baseline that preserves contract constraints while randomizing the entry condition is a practical way to expose that problem. If the baseline performs similarly, the signal probably deserves less credit than the chart suggests.

The same principle applies to time windows. Run a small out-of-sample slice early, even before the full framework is polished. If the system only works on the period used to build it, that is not an argument to tune more. It is a reason to simplify the hypothesis.

Why Developers Should Avoid Last-Price Comfort

Options last prices are attractive because they are easy to fetch and easy to plot. They are also a weak execution proxy. Many contracts trade sparsely, and a last sale can describe an old market state rather than the market your strategy would have crossed.

For a developer, the better default is quote-aware replay. Use bid and ask state, quote timestamps, spread checks, and reject reasons. The result will usually be less flattering. That is not a failure. It means the simulator is starting to resemble the market surface the code would actually meet.

Treat Rejects As Measurements

Rejects are not noise around the research process. They are measurements of the tradable surface. A rejected trade can tell you that the signal fired outside the liquid part of the chain, that the target DTE was unavailable, that the spread was too wide, or that the selected contract had no usable quote. Those are different conclusions, and they lead to different next experiments.

For a developer new to trading systems, this is a useful mental shift. In many product systems, an error is something to reduce. In research replay, a rejected event can be the point of the experiment. If a strategy loses half of its opportunities after realistic quote checks, that is not merely an implementation inconvenience. It is evidence about whether the idea can be expressed in the market.

Good reject logs should be structured enough to aggregate. A text blob is hard to compare across runs. A reason code such as stale_quote, wide_spread, no_listed_expiry, or contract_pool_empty lets you see whether a new branch improved the signal or only moved failures into a different bucket.

What To Log

Log the raw inputs, the selected contract, the quote used for pricing, the rejected alternatives, and the final trade row. Then log the summary separately: return, drawdown, Sharpe, trade count, coverage, and any robustness diagnostics.

This separation matters because debugging a backtest is often about finding the first wrong assumption, not the final bad number. If a strategy improves after a code change, you need to know whether the signal improved or the simulator became more permissive.

Decide What Promotion Means

The first loop does not need to choose the final strategy. It should still define what it would take to promote a result into deeper research. Promotion criteria can be simple: enough trades, enough active days, tolerable drawdown, no single-day dependency, quote rejects within an expected range, and stable behavior under nearby parameters.

Those criteria protect the developer from the most common failure mode: treating the best row in a parameter grid as an answer. In a scientific workflow, the best row is a candidate observation. It becomes more meaningful only if neighboring rows tell a similar story, if the execution assumptions remain constant, and if the result survives a held-out period.

The point is not to make the first project bureaucratic. The point is to prevent early enthusiasm from rewriting the evidence. A clear promotion rule makes it easier to close weak branches without feeling that useful work was wasted.

Takeaway

The developer's first backtesting loop should be small, causal, and auditable. Start with the evidence chain before expanding the idea set. Once that loop is honest, optimization becomes useful. Before that, optimization mostly makes weak assumptions look precise.

FAQ

Related questions

What should a developer build first in a backtesting project?

Build a small causal replay loop with timestamped signals, point-in-time contract selection, quote-aware fills, and auditable trade logs.