HomeBlogStrategy Robustness Testing: PBO, Deflated Sharpe, and Overlap Filters Explained
ValidationApril 15, 2026·5 min read

Strategy Robustness Testing: PBO, Deflated Sharpe, and Overlap Filters Explained

CuteMarkets

CuteMarkets Team

Research

Strategy Robustness Testing: PBO, Deflated Sharpe, and Overlap Filters Explained

Repository reference: cutebacktests

Abstract

Strategy robustness testing is the stage where a promising model is forced to compete against three uncomfortable questions. Does the edge survive repeated selection pressure? Does the performance still look meaningful after accounting for multiple testing and sampling luck? Does the strategy add something useful to the rest of the book, or is it just another way to take the same risk?

This repository provides concrete answers to those questions because its recent promotion logic is explicit. The March 8 audit repaired combined-fold PBO and DSR usage, as documented in Backtesting Framework Issue Summary. Later, the c4 follow-up work imposed a hard portfolio gate that required more than positive return. In Episode 8, the branch needed feasible selection, trades_per_week >= 1.5, orb_overlap_days >= 30, c66_overlap_days >= 30, offset_ratio_on_orb_down_days >= 0.5, zero extra option attempts, and zero quote rejects. That is robustness testing in the form it actually takes when a strategy is close to admission.

Question

The practical question is not whether PBO or Deflated Sharpe are useful in theory. The real question is how those diagnostics interact with sample size, feasibility, overlap, and execution cleanliness when a strategy approaches deployment.

In this repo, the answer is that robustness is multi-layered. A branch can have a good-looking PnL path and still fail because its fold-level diagnostics are weak. It can have decent diagnostics and still fail because the trade frequency is too low. It can improve on both fronts and still fail because it does not offset the rest of the portfolio well enough or because its live option path remains too messy.

Method: How Strategy Robustness Testing Works in Practice

I think of robustness testing here as a stack rather than as one metric.

At the first layer are repeated out-of-sample results and fold-aware diagnostics. PBO asks, in effect, how often the apparent winner from the training side fails to hold up out of sample. Deflated Sharpe, or closely related DSR-style adjustments, asks whether the reported Sharpe still looks meaningful after discounting for selection pressure and sampling luck. These metrics are useful because they fight a common research bias: the tendency to believe the best variant is meaningful simply because it is the best variant in the backtest.

At the second layer are deployability constraints. A strategy with a good DSR can still be useless if it trades too rarely, if its option structures are not feasible often enough, or if its quote path is too messy. That is why the c4 gate included trade frequency, selection feasibility, zero extra option attempts, and zero quote rejects.

At the third layer are portfolio filters. Overlap days and offset behavior matter because a live portfolio does not need another branch that behaves exactly like the existing winner. The c4 branch was evaluated directly against that standard through orb_overlap_days, c66_overlap_days, and offset_ratio_on_orb_down_days.

Evidence / Results

The March 8 audit repaired the metrics side of this process in a very direct way. According to Backtesting Framework Issue Summary, top-level PBO and DSR had been using per-symbol fold rows and the wrong scenario assumptions. The patch added combined-fold aggregation across symbols and separated dashboard diagnostics from selection diagnostics. That is a major change because portfolio robustness should be evaluated on the combined strategy stream, not on a flattering symbol-by-symbol decomposition.

The c4 branch then provides the clearest gate-level example. As summarized in Toward The One Piece Of Sharpe and Episode 8, repaired stock-stage variants restored 79 and 85 trade rows and made the branch materially more interesting after a debugging issue was fixed. Even so, the final answer on April 18 was park_c4.

That result is informative because it shows what robustness testing is supposed to do. It is supposed to reject a strategy that became more interesting but still did not clear the combined bar. A weaker research culture would have reported the repair and quietly ignored the gate failure. This repo reported both.

What Worked

What worked was the refusal to treat one robustness metric as a magic key. The repo did not say that passing PBO was enough. It did not say that positive return was enough. It did not say that restoring trade count after a bug fix was enough. It kept asking whether the branch was feasible, active enough, clean enough, and additive enough.

This also helps explain why c66 became the lead_paper_bot. Its value was not one isolated return number. It was the combination of out-of-sample stability and admission readiness. That is a robustness story with operational consequences, not a simple profitability headline.

What Failed

What failed was the idea that strategy quality can be summarized by one attractive chart or one metric. The c4 case shows this clearly. The branch improved after debugging and density repair, but still failed because several smaller deficits accumulated: feasibility, overlap, offset behavior on ORB-down days, and option-parity cleanliness. That is how many real strategies die. Not in one dramatic collapse, but by failing to become clearly additive enough.

The compression family gives another cautionary example. c66 looked strong, yet a nearby compression variant still failed pbo_ok and dsr_ok. That tells you the robustness bar is doing useful work. It is preventing researchers from making lazy family-level claims on the back of one winner.

Takeaway

Strategy robustness testing is not a ceremonial final step. It is the process that turns a profitable backtest into either a candidate or a rejection. In this repository, that process is explicit enough to study: fix the fold-level metrics, test the branch on the right combined object, then apply deployability and overlap gates that reflect the actual portfolio problem.

If you want the temporal side of this discipline, Walk-Forward Backtesting: How to Test a Trading Strategy Without Fooling Yourself is the right companion. If you want a case study in a branch that still failed after improving, Why c4 Was Parked: A Dispersion Strategy That Improved But Still Failed the Portfolio Gate carries the story further. Join the research log to get the next backtest and failure report.