feedback
← Back to Learn

Avoiding Overfitting in Strategy Backtests

Why a great backtest can mean nothing — and how to tell the difference between a real edge and a curve-fitted illusion.

What is overfitting?

A strategy is overfitted when its parameters have been tuned so precisely to historical data that the strategy is essentially memorizing the past rather than learning a generalizable rule. It looks exceptional in backtest and performs poorly — or randomly — in live trading.

Overfitting is not always intentional. It happens naturally whenever you:

  • Scan many parameter combinations and pick the one with the highest Sharpe
  • Modify the strategy after seeing the backtest results, then retest on the same data
  • Add conditions specifically to avoid losing periods you can see in the chart
  • Use a short backtest period where a lucky parameter combination outperforms by chance
The core problem: Every backtest dataset contains both real patterns and noise. A complex enough strategy can always fit the noise. The noise will not repeat in the future.

The parameter scanning trap

Suppose you scan 100 parameter combinations. Even if your strategy has zero edge, some combinations will produce a positive Sharpe just by chance — roughly 5 out of 100 at a 5% significance level. If you pick the best one and report it as your strategy's performance, you're reporting noise as signal.

This problem gets worse the more parameters you scan. A strategy with 2 independent parameters on a 20×20 grid has 400 combinations. A strategy with 4 parameters on a 10×10×10×10 grid has 10,000. Each additional dimension multiplies the chance of stumbling onto a lucky peak.

Rule of thumb: For every free parameter in your strategy, you need roughly 10× more trades in your backtest to make the results statistically meaningful. Two parameters = you need a lot of trades. Four parameters = the backtest length required is usually unrealistic.

The peak vs the plateau

The single most important practical technique for avoiding overfitting is to not pick the peak.

Look at a Sharpe heatmap from a parameter scan. The absolute best cell might show Sharpe 1.8. One step to the left: 0.4. One step up: 0.3. That peak is fragile — it exists because the exact combination of parameters happened to fit a few specific market events in your training data.

Instead, find the plateau: the region where Sharpe is consistently high across many neighboring parameter combinations. If Sharpe is 1.2 across a 3×3 neighborhood of parameter values, that's a robust signal — the edge exists across a range of parameters, not just at one specific point.

# BlaveClaw's find_plateau: picks the neighborhood with the
# highest average Sharpe in a 3×3 window around each cell —
# not the single maximum cell.
best_idx = argmax(neighborhood_average_sharpe)

This is why BlaveClaw's scan.py uses find_plateau instead of simply taking the best Sharpe. A parameter set sitting atop a smooth plateau is far more likely to generalize than one sitting at a sharp peak surrounded by poor performance.

In-sample vs out-of-sample

The only valid way to measure a strategy's real performance is to test it on data it has never seen. If you optimize parameters using data from 2020–2023 and then evaluate on 2024, the 2024 performance is a genuine out-of-sample estimate.

If you optimize on 2020–2024 and report 2020–2024 performance: that's in-sample. Even if you didn't intend to overfit, the optimization process implicitly used the 2020–2024 outcomes to select the parameters. The reported Sharpe includes the benefit of hindsight.

MethodValid?Notes
Optimize on full history, report full history performanceIn-sample — overfitting guaranteed
Optimize on first 70%, report last 30% performanceSimple train/test split — valid but limited
Walk-forward: optimize on rolling window, report next period✓✓Most realistic — same method as live trading

Scan ranges that prevent extremes

Another common overfitting trap is scanning across the full range of possible parameter values, including extreme ones that happen to produce high Sharpe on one unusual historical period.

BlaveClaw uses a different approach for threshold-based strategies: derive the scan range from the indicator's own distribution. Entry candidates come from the upper half of the indicator's historical distribution (p50–p90); exit candidates from the lower half (p10–p50). This keeps both thresholds grounded in observed data ranges, avoiding obviously unrealistic extremes.

The practical effect: you're not testing an entry threshold of 5.0 when the indicator has only reached that level twice in history. You're scanning ranges where there's enough historical data to compute meaningful statistics.

Signs your strategy is overfitted

1

Sharp peak on the Sharpe heatmap. The best parameter combination is surrounded by much lower values. One parameter step in any direction drops Sharpe by 0.5+. The edge is not robust — it's local.

2

Backtest Sharpe doesn't survive out-of-sample. Strategy shows Sharpe 2.0 on the training period, 0.2 on a held-out test period. The larger the gap, the more overfitted the strategy is.

3

Very few trades. A strategy with 12 trades in 3 years of backtest has almost no statistical power. A few lucky trades can produce a Sharpe of 2.0+ purely by chance. Require at least 30–50 complete trades before treating backtest metrics as meaningful.

4

Suspiciously round parameters. If your best parameters are SMA(50) / SMA(200) or RSI(14) or exactly 1.0 / −1.0, these are often the result of scanning and getting lucky at a well-known "default" value. They may work for the wrong reasons.

5

Strategy only works in one market regime. If backtest performance is entirely driven by one bull or bear period, and flat or negative in all others, the strategy may have learned that specific regime rather than a general rule.

How many parameters is too many?

There is no universal cutoff, but a useful principle: every additional free parameter should have a prior reason to exist in the strategy logic. Don't add a parameter because it lets you fit the data better — add it because it represents a genuine dimension of the market mechanism you're exploiting.

A strategy with two parameters that represent meaningful concepts (a fast signal window and a slow confirmation window, for example) is more defensible than a six-parameter strategy where several of the parameters were added to "tune out" specific losing periods visible in the chart.

← Back to Learn