On this page
Here is an uncomfortable fact. If you backtest 100 random strategies on the same crypto data, the best of them will have a respectable Sharpe ratio — not because it has an edge, but because you ran 100 trials and kept the luckiest. This is the multiple-testing problem, and it is the single biggest reason retail backtests fail live. The deflated Sharpe ratio is the statistic that prices that luck back out.
Why your Sharpe is inflated before you even start
A plain Sharpe ratio is computed as if the strategy in front of you were the only one you ever tried. It almost never is. Every parameter you swept, every indicator combination you compared, every variant you discarded — each was a trial. The Sharpe you finally report is the maximum over all of them, and the maximum of many noisy numbers is biased upward. The more you searched, the more inflated it is.
Two strategies can show the same backtested Sharpe of 1.8 and mean completely different things. If one came from a single test and the other is the best of 500, they are not comparable. The first might be real. The second is, statistically, mostly the residue of searching.
What deflation actually does
The deflated Sharpe ratio (DSR), introduced by Bailey and Lopez de Prado, asks a sharper question than "is the Sharpe positive?" It asks: given how many strategy variants were tried, and given that crypto returns are skewed and fat-tailed, what is the probability this Sharpe exceeds zero for real? It adjusts for three things at once:
- The number of independent trials — more searching means a higher bar the Sharpe must clear.
- The length of the track record — a short backtest gives a noisier estimate, so the result is discounted harder.
- Skew and kurtosis — non-normal returns make the standard Sharpe estimate less reliable, and the correction accounts for it.
The output is a probability between 0 and 1. A DSR of 0.95 means: even after accounting for all that searching, there's a 95% chance the edge is real. A DSR of 0.4 means your impressive Sharpe is more likely an artefact of how hard you looked.
Its cousins — PSR and PBO
DSR sits in a small family of honesty checks. The probabilistic Sharpe ratio (PSR) is the simpler relative: it asks whether a single Sharpe beats a benchmark Sharpe given the track-record length and the return distribution, without the multiple-testing term. DSR is essentially PSR with the trial-count correction added.
The probability of backtest overfitting (PBO) comes at the same problem from a different direction. It uses combinatorial splits of your data to estimate how often the configuration that looked best in-sample underperforms the median out-of-sample. A PBO above 0.5 means your selection process is, more often than not, picking losers. Read DSR and PBO together: one prices the luck in your headline number, the other audits the process that produced it.
How to use it without fooling yourself
- Count trials honestly — every parameter combination and every discarded variant is a trial. The number is larger than you think.
- Treat DSR as a gate, not a score to maximize. A strategy below ~0.95 is not ready, regardless of how good the raw Sharpe looks.
- Never optimize toward DSR itself — the moment you do, it becomes one more trial and stops being a check.
- Pair it with out-of-sample evidence. DSR corrects a number; walk-forward and Monte Carlo test the strategy. You want both.
In Noon Barbari
Noon Barbari reports deflated Sharpe, probabilistic Sharpe, and PBO as part of its standard evaluation suite — alongside Sortino, Calmar, max drawdown, and drawdown duration — so the multiple-testing correction is in front of you, not buried in a paper. Pair it with the walk-forward optimizer and a Monte Carlo run for the full picture. The free tier lets you build and backtest one strategy with no time limit, so you can see your own deflated Sharpe before you trust the headline one.
The lesson is cheap to learn here and expensive to learn live: a high Sharpe is a claim, not a conclusion. The deflated Sharpe ratio is how you check the receipt.
Try it on your own data
Every concept above is implemented in the platform. Backtest, walk-forward, paper-trade, then promote to live — same rule set, all stages.