HomeBlogEpisode 3: The Simulator Audit
Research SeriesMarch 8, 2026·3 min read

Episode 3: The Simulator Audit

CuteMarkets

CuteMarkets Team

Research

Episode 3: The Simulator Audit

Scope

This episode is anchored in backtesting_framework_issue_summary_20260308.md.

Unlike the first two episodes, the evidence here is explicit and direct. This was the week when the repository named the places where the framework was overstating confidence and then patched them.

Result Snapshot

Five patched issues changed the scientific meaning of the repo:

IssueWhy it mattered
contract selection cache ignored the relevant underlying price bucketwrong strike could be silently reused
stop_touch used same-bar informationsame-bar lookahead in momentum/event paths
overnight MR used entry-bar full stateanother same-bar leakage path
combined Sharpe and Sortino flattened per-symbol returnsportfolio risk was overstated
top-level PBO and DSR used the wrong fold granularityrobustness selection was misaligned

This was not cosmetic work. These are exactly the kinds of mistakes that can make a strategy look stable when it is merely benefiting from information leakage or bad aggregation.

Each issue also distorts a different layer of inference. Wrong contract-cache reuse changes the instrument being tested. Same-bar lookahead changes the information set that the signal is allowed to use. Flattened per-symbol daily returns distort the portfolio estimator itself. Misaligned PBO and DSR usage contaminate the selection procedure that determines which profile is allowed to look "robust." In other words, these were not all the same category of bug. They attacked the validity of the conclusions from multiple angles at once.

The Hard Truth

The repo did something many research codebases avoid: it made the simulator less flattering on purpose.

Behavior changes recorded in the audit included:

  • stop_touch now means signal on bar t, enter on bar t+1
  • overnight MR only uses prior completed bars
  • combined Sharpe and Sortino come from real aggregated daily PnL
  • PBO and DSR diagnostics are split correctly between dashboard and selection scenarios

That means some old excitement had to be discounted. The repo implicitly accepted that cost.

What Worked

What worked was not a specific model. What worked was the willingness to treat metric integrity as a production issue.

The test coverage added in the audit matters for that reason. The repo did not just patch the behavior. It also wrote regressions around:

  • cached contract universes
  • next-bar stop-touch entry semantics
  • prior-bar overnight MR semantics
  • combined-day risk aggregation
  • combined-fold PBO and DSR usage

If you want to build in public credibly, this is how you do it. You show not just the performance chart, but the list of assumptions you found unsafe and the tests you added so they do not quietly come back.

What Did Not Work

The negative result is unavoidable: some previously reported strength, especially in intraday options paths, must be treated as lower-confidence once these fixes are in place.

That is not a failure of the audit. That is the success condition of the audit.

The repo also left one item intentionally unresolved: the default fill-model mismatch between orb_confluence and orb_conviction. That restraint is scientifically useful. It distinguishes between:

  • bugs that should be fixed immediately
  • defaults that need an explicit product-level decision

That distinction is part of the style this project should keep publicly. A scientific writeup does not need to present the codebase as fully settled. It needs to separate known implementation defects from open design choices. The first category invalidates evidence if left unresolved. The second category changes the interpretation of evidence and therefore has to be documented, not silently normalized.

Why This Week Matters

This is the week the project stopped being only a strategy playground and became a measurement system with standards.

If we keep the One Piece analogy mild, this is the episode where the crew checks whether the compass itself is broken. You do not hunt treasure with a lying compass.

Public Build Takeaway

This episode should be published with no defensiveness. It is one of the strongest credibility signals in the whole repo.

The public lesson is:

  • the fastest path to fake alpha is sloppy measurement
  • bug-fix posts are not side content; they are core research content
  • if the audit makes your earlier results weaker, that is progress

Any audience worth building will respect this episode more than a polished chart with hidden leakage.