Sur cette page

If you've done any serious backtesting you know about walk-forward optimisation — split history into rolling in-sample and out-of-sample windows, tune on the in-sample, validate on the out-of-sample, stitch the out-of-sample chunks together. It works. It catches most curve-fits. It is still wrong in a subtle way that matters when you actually deploy capital.

Walk-forward gives you one out-of-sample equity curve. One. That curve is the result of one specific way of slicing your history into chunks, in one specific order. Slice it differently — start the first in-sample window six months later, or use 9-month chunks instead of 12-month — and you get a different out-of-sample curve. Some of those curves look great; some look terrible. Walk-forward shows you one, and the strategy gets deployed (or killed) based on whether that one happened to look good.

Marcos López de Prado, formerly head of AQR's machine learning effort and author of the most-cited book in modern quant finance, calls this the "backtest selection bias" problem. His proposed fix, combinatorial purged cross-validation (CPCV), is the cleanest available answer.

The core idea

CPCV generalises walk-forward in three ways. First, it doesn't just slide one window forward — it computes every possible combination of train/test splits over the history. Second, it purges any test observation whose information overlaps with the training set (a common source of leakage in time-series with labels that look forward). Third, it embargoes a small buffer between train and test to handle serial correlation.

The output is not a single equity curve. It's a distribution of equity curves — one per combination. From that distribution you can compute the mean expected performance, the standard deviation, and the probability that the live performance will fall below any chosen threshold.

The mechanics, step by step

Divide the full history into N equal-length groups (typical: N=10).
Choose k groups as the test set, the remaining N−k as train (typical: k=2, so 8 train / 2 test).
Train the strategy parameters on the train set, evaluate on the test set, record the out-of-sample performance.
Apply purging: remove training observations whose label horizon overlaps the test set. (Critical for ML or time-aware labels; less so for plain price-only strategies.)
Apply embargo: ignore training observations within a small buffer (e.g. 1% of total bars) after the test set to handle autocorrelation.
Repeat for every distinct combination of k test groups out of N. With N=10, k=2, that's C(10,2) = 45 combinations.
Stitch the 45 out-of-sample test segments into 45 reconstructed equity curves (since each test segment covers different bars).
Compute the mean and standard deviation of any performance metric (Sharpe, CAGR, max DD) across the 45 curves.

CPCV vs walk-forward, at a glance

Walk-forward (N=10, k_test=2, anchored):
  Iterations:        1 (sliding forward)
  OOS curves:        1
  OOS observations:  ~20% of history

CPCV (N=10, k_test=2):
  Iterations:        C(10,2) = 45 combinations
  OOS curves:        45 distinct equity reconstructions
  OOS observations:  every bar appears in OOS in ~9 different splits

# Each bar gets evaluated as out-of-sample many times,
# combined with many different training windows. The variance
# of the resulting performance distribution is the real signal.

What CPCV shows you that walk-forward doesn't

Walk-forward says "this strategy made X% out of sample over the test period." CPCV says "this strategy's out-of-sample Sharpe is 0.7 ± 0.3 across 45 different train/test splits, with a 15% chance of being below 0.4." That second statement is operationally useful in a way the first isn't.

Specifically, CPCV gives you a probability of backtest overfitting (PBO) — the probability that the strategy's apparent edge is selection bias rather than real. PBO above 50% means the strategy's optimisation found something that doesn't generalise. PBO above 80% means "do not deploy." PBO under 30% is meaningful evidence the edge is real.

Why purging and embargo matter

If your strategy uses labels that look forward — for example, the label of bar t is "did price rise by 1% in the next 24 hours" — then training on bar t and testing on bar t+1 leaks information. The training set already knows what happens in the test window. Purging removes those overlapping training observations; embargo adds a buffer to handle autocorrelation in returns.

For plain price-only strategies (no forward-looking labels), purging is less critical and embargo can be small. For ML strategies with multi-bar label horizons, purging is essential and embargo can be a few percent of total history. The platform's CPCV implementation auto-detects label horizons from the strategy definition and applies appropriate purging.

When to use CPCV vs walk-forward

Walk-forward is faster (1 fit vs 45). Use it for early iteration when you're still tuning the strategy structure.
Walk-forward is intuitive — one OOS curve maps directly to how the strategy would have been deployed historically. Use it for client-facing reports.
CPCV catches more overfitting and gives confidence intervals. Use it before deploying capital.
CPCV is required for marketplace publication on Noon Barbari — we use the PBO output to decide whether a strategy can be listed.

Running CPCV on the platform

rules.yaml — strategy with CPCV validation block

strategy:
  name: ema_cross_v3
  indicators:
    - { id: fast, kind: EMA, period: { sweep: [10, 14, 21] } }
    - { id: slow, kind: EMA, period: { sweep: [50, 100] } }
  rules:
    entry: { type: cross_above, left: fast, right: slow }
    exit:  { type: cross_below, left: fast, right: slow }
  validation:
    method: cpcv
    n_groups: 10
    k_test:   2
    embargo_pct: 1.0
    label_horizon_bars: 0       # no forward-looking labels
    metrics: [sharpe, cagr, max_drawdown, pbo]
    require:
      mean_sharpe: { min: 0.6 }
      pbo:         { max: 0.5 }
      worst_split_drawdown: { max: 0.30 }

The require: block makes the validation fail-closed: if mean OOS Sharpe drops below 0.6, or PBO exceeds 50%, or the worst single split has a >30% drawdown, the strategy is rejected for marketplace publication and flagged in the run results. The numbers are conservative defaults; tighten them for your own bar.

Limitations

Three real ones. First, CPCV is computationally heavy — 45 backtests per parameter combination, on top of any parameter sweep. A run that takes 30 seconds with walk-forward can take 20 minutes with CPCV. Second, the choice of N and k affects the result. The defaults (N=10, k=2) are reasonable for crypto with 2+ years of data; with less data, fewer splits are honest. Third, CPCV measures generalisation across the history you have. It cannot measure generalisation to a future regime that's structurally different from anything in your sample.

Next steps

CPCV is most useful as a check on top of walk-forward results — start with WF for iteration, finish with CPCV before deployment. And the deflated Sharpe ratio post covers the complementary multiple-testing correction that López de Prado proposed alongside CPCV.

Essaie-le sur tes propres données

Chaque concept ci-dessus est implémenté dans la plateforme. Backtest, walk-forward, paper trading, puis passage en live — même jeu de règles à chaque étape.

Inscription gratuite Premiers pas

CPCV: the cross-validation method that catches overfitting walk-forward misses

The core idea

The mechanics, step by step

What CPCV shows you that walk-forward doesn't

Why purging and embargo matter

When to use CPCV vs walk-forward

Running CPCV on the platform

Limitations

Next steps

Essaie-le sur tes propres données

Lectures associées

Guides associés

Termes clés