Building a Portfolio of Trading Models: Why One Good Backtest Is Not Enough

Daniel Ratke
Research & Engineering

Term map
Backtesting vocabulary for this article
Treat signal timestamp, point-in-time universe, quote-aware fill, reject reason, replay artifact, walk-forward test, and cache key as first-class terms. They separate reproducible research from a backtest that only preserves the final performance table.
Follow the linked definitions for Point-in-time contracts, Quote-aware fills, Reject reasons, Replay artifact, Cache key, Signal timestamp, Look-ahead leakage, Walk-forward test, Slippage model, Same-bar fill, Promotion gate, and Options data API.
Repository reference: cutebacktests
Abstract
One good backtest is not a portfolio. It is at most a candidate input to a portfolio. That distinction became one of the central conclusions of this repository over the last two months, and it changed the research objective itself.
The transition is explicit in Episode 5 and PAPER_BOTS.md. The project moved away from trying to optimize broad standalone ORB search and toward building a small diversified paper-bot portfolio. That is a much harder problem because a strategy now has to be credible on its own and in the company of the other sleeves.
Question
The practical question is not whether one model can make money. The useful question is whether a set of models can coexist with low enough overlap and high enough credibility to justify a live portfolio.
That is why portfolio thinking changes the gate. A strategy that looks convincing in isolation may still be a weak portfolio addition if it overlaps too much with the current anchor, fails offset tests, or is too sparse to pull its weight.
Method: Why a Portfolio of Trading Models Needs More Than Raw PnL
This repository's roadmap now frames the task directly. In paper_bot_portfolio_r1/roadmap.json, the goal is to build a small diversified paper-bot portfolio rather than to keep searching broadly for a single ORB winner.
That shift changes the evaluation object. Instead of asking "which branch has the nicest backtest," the repo now asks things like:
- does the branch survive under stress scenarios
- does it trade often enough to matter
- does it overlap too much with current leaders
- does it offset drawdowns in the right regimes
- is the option path clean enough to operate
This is why c4's gate included orb_overlap_days, c66_overlap_days, and offset_ratio_on_orb_down_days. Those are portfolio questions, not isolated backtest questions.
Evidence / Results
The current practical order from PAPER_BOTS.md is:
c66_strict_parity_paper_bot_r1c4_open_paper_candidate_r1c36_open_paper_candidate_r1
That order already implies portfolio thinking. c66 is first because it has the strongest current deployable evidence, including base out-of-sample return 19.18%, stress-medium 16.70%, stress-harsh 15.56%, and 76 out-of-sample trades across all scenarios. c36 stays below it because the quality branch is too sparse. c4 remained interesting and was still parked because the overlap and feasibility bar remained too demanding.
The QQQ dispersion sleeve then sits behind the formal ladder as research_only, even after positive-looking results such as qqq_single_base with 9 trades and +44537.92. That is a useful reminder that strong numbers on thin samples are not enough to claim portfolio membership.
What Worked
What worked was the change in objective function. Once the repo stopped treating the problem as "find the best ORB" and started treating it as "build a low-overlap set of believable sleeves," the research became more coherent. Strategies could now be classified by role: lead paper bot, backup candidate, parked near-miss, research-only sleeve.
This is one reason the current state of the repo is more interesting than the earlier broader search. The list of survivors is smaller, but the roles are clearer.
What Failed
What failed was the earlier hope that one family would dominate and scale. The ORB audit, the later roadmap, and the c4 gate all point away from that conclusion. A strong isolated backtest did not solve the real problem. The real problem was assembling a group of sleeves that could coexist under realism, parity, overlap, and deployability constraints.
That is a valuable negative result because many public research threads end at the first green chart. Portfolio construction begins exactly where that kind of content usually stops.
Takeaway
A portfolio of trading models needs more than one attractive backtest. It needs a set of branches that survive individually and make sense together. This repo's recent work is valuable because it now evaluates models under that higher standard.
If you want the state-of-the-journey summary, The One Piece of Sharpe: What Months of Intraday Options Backtesting Actually Taught Us is the capstone. If you want the methodology behind public reporting, Algorithmic Trading Research Log: How to Build in Public Without Hiding Failed Results explains the publishing philosophy. Join the research log to get the next backtest and failure report.
Related workflow
For the Building a Portfolio of Trading Models: Why One Good Backtest Is Not Enough workflow, continue through Options Backtesting API, Backtesting Framework, Backtesting Execution Realism, Backtesting Data Quality Checklist, Quote-Aware Options Backtests, and Backtest to Paper Trading Parity Checklist.
How the terminology applies
For Building a Portfolio of Trading Models: Why One Good Backtest Is Not Enough, the backtesting workflow should treat Point-in-time contracts, Quote-aware fills, Reject reasons, Replay artifact, Cache key, and Signal timestamp as operational state rather than glossary decoration. That framing keeps the research claim causal: the strategy can only select instruments, prices, and labels that existed at the decision time.
A developer implementing this Research Log idea should persist Look-ahead leakage, Walk-forward test, Slippage model, Same-bar fill, Promotion gate, and Options data API beside the result, instead of leaving those words in a term card. It also turns attractive performance into an auditable record where fills, skips, thresholds, and replay inputs can be challenged independently.
The review artifact for Building a Portfolio of Trading Models: Why One Good Backtest Is Not Enough becomes more useful when OPRA-originating data, OCC option symbol, Bid/ask spread, Midpoint, Quote/trade condition, and Quote vs trade semantics appear in the same body of evidence as the selected rows. When a result is promoted, these fields should appear in the run manifest, rather than a prose summary or final equity curve.
In production notes for this backtesting workflow, REST snapshot, WebSocket stream, Entitlement gate, Quote freshness, Timestamp semantics, and Pagination cursor define the checks that decide whether the workflow is reproducible. The result is a backtest that can be rerun, compared across threshold families, and rejected when the evidence is not strong enough.
For Building a Portfolio of Trading Models: Why One Good Backtest Is Not Enough, the practical acceptance test is simple: another developer should be able to read the body, identify the exact inputs, reproduce the request sequence, and explain the accepted and rejected rows without relying on the bottom terminology grid. If a phrase appears in the page vocabulary, it should correspond to a stored field, a validation check, a replay step, or an implementation decision in the backtesting workflow.
This is also the reason the article should not measure success only by the final chart, table, or headline metric. The better standard is whether the data path, timing model, entitlement state, and evidence trail survive review. When those pieces are written directly into the body, the terminology becomes part of the workflow readers can implement.
Model portfolios need common evidence
A portfolio of models is only as comparable as its artifacts. Each model should store signal timestamp, dataset, schema version, contract selection rule, quote-aware fill policy, slippage model, and reject taxonomy. For options models, include OCC option symbol, DTE bucket, bid/ask spread, NBBO freshness, quote condition, trade condition, open interest, implied volatility, and no-bid exit handling.
Without that common evidence, a portfolio can combine incompatible assumptions. One model may be measured on aggregate bars while another is measured on top-of-book quotes. One may count stale quote rejects while another omits them. A better portfolio review puts operating quality beside return quality before deciding which model deserves paper capital.
Terminology
Market-data terms used in this article
These terms keep the article connected to the CuteMarkets knowledge base and to the exact API workflow behind the research.
Point-in-time contracts
Contract discovery anchored to the research date so a backtest does not use future listings.
Quote-aware fills
Entry and exit assumptions based on bid/ask quotes, quote age, spread width, and side-specific fill rules.
Reject reasons
Logged explanations for skipped contracts or fills, including stale quote, wide spread, no bid, or missing data.
Replay artifact
The saved request, selection, fill, reject, and metric record that lets another developer audit the backtest.
Cache key
The structured identifier that keeps provider, endpoint, ticker, timestamp, plan, and schema state from being mixed.
Signal timestamp
The exact time a strategy made a decision, used to reconstruct the visible universe and quote window causally.
Look-ahead leakage
A research error where a fill, contract, indicator, or label uses information unavailable at decision time.
Walk-forward test
A validation method that repeatedly trains and evaluates across separated time windows instead of trusting one optimized sample.
Slippage model
A fill-cost assumption based on bid/ask side, midpoint, spread percent, quote age, and liquidity policy.
Same-bar fill
An intraday backtest assumption that can become invalid when signal, entry, stop, and target ordering is ambiguous.
Promotion gate
The written threshold that decides whether a research candidate can move into paper trading or production monitoring.
Options data API
The product surface for chains, contracts, quotes, trades, aggregates, Greeks, IV, open interest, and expirations.
OPRA-originating data
The U.S. listed-options source context behind quotes, trades, exchange participation, and consolidated option-market records.
OCC option symbol
The exact option contract identifier that preserves root, expiration, call or put side, and strike.
Bid/ask spread
The execution interval between bid and ask that determines whether a contract is realistically tradable.
Midpoint
The computed center between bid and ask, useful as a reference price but not proof that an order would fill.
Quote/trade condition
The condition-code, exchange, correction, sequence, and timestamp context that explains how a quote or trade row can be used.
Quote vs trade semantics
The distinction between executable bid/ask markets, printed transactions, and bar-level summaries.
REST snapshot
A reproducible request for current or historical market state, used for initialization, backfills, and audit logs.
WebSocket stream
A persistent live connection that needs subscription topics, reconnect tracking, freshness labels, and REST repair paths.
Entitlement gate
The product, plan, quote, live, delayed, historical, or commercial-use boundary checked before data is shown.
Quote freshness
The age, timestamp, and live or delayed state of a bid/ask record before it is used in a scanner, backtest, or UI.
Timestamp semantics
The exchange, provider, ingestion, session, and application time context attached to a market-data record.
Pagination cursor
The continuation token or next URL that keeps large chains, trades, quotes, and historical windows complete.

Written by
Daniel Ratke
Research & Engineering
Daniel covers the deeper research notes: options backtesting, execution realism, robustness testing, data engineering, and strategy validation.
Product links
Build the workflow with CuteMarkets
This article is part of the broader CuteMarkets product and research stack. Use the landing pages below to move from the blog into the specific API workflow you want to evaluate.
Beginner options path
Send newcomers to the beginner path for calls, puts, chains, Greeks, IV, and risk.
Options Data API
See the main options overview for real-time and historical options data.
Historical Options Data API
Inspect the historical contracts, quotes, trades, and aggregates workflow.
Options Chain API
Go straight to chain snapshots, expirations, and strike discovery.
Pricing
Review plans before you move from free evaluation into production usage.