Why individual strategy backtests aren't enough — and how BlaveClaw's walk-forward portfolio simulation validates the whole picture.
Suppose you've built three strategies, each with a respectable Sharpe ratio. You combine them in a portfolio. Is the portfolio better than any single strategy? You can't know without simulating how they interact.
Three things that only appear at the portfolio level:
BlaveClaw's management_backtest.py runs a walk-forward simulation — the same method the live manager uses, but replayed on historical data. It is the only valid way to estimate portfolio-level performance.
The first lookback days (default: 365) are used as the initial calibration window. Performance reporting starts only after this — strictly out-of-sample.
Each day, the optimizer finds the weight vector w that maximizes the slope/volatility of the combined equity curve over the past 365 days. Constraints: weights sum to 1, all ≥ 0 (no shorting strategies).
The weights just computed are applied to the next day's actual strategy returns. The optimizer never sees the day it's predicting — this is what makes it out-of-sample.
The window slides forward. Each day gets fresh weights based on the most recent 365-day history.
To know whether dynamic weight optimization actually adds value, the management backtest generates 1,000 random portfolios using the same strategies but with static random weights (sampled from a Dirichlet distribution). The key output metric is:
This benchmark matters because a manager that "beats random" in backtesting has a stronger case for doing so live. One that doesn't is either underperforming or relying on overfitted weights.
The output manager/pnl.png has three panels:
| Panel | What it shows | What to look for |
|---|---|---|
| Top — Cumulative Return | Managed portfolio (green) vs. p5–p95 band of 1,000 random portfolios (grey) | Managed line should trend above the random median. Consistent outperformance > 2 years is meaningful. |
| Middle — Drawdown | Managed portfolio's drawdown from equity peak | MDD should be tolerable. Deep drawdowns that recover slowly suggest low diversification or strategies too correlated. |
| Bottom — Weight History | Each strategy's allocation % over time | Stable weights → strategies have consistent relative performance. Rapidly switching weights → the optimizer is chasing short-term noise. |
The optimizer maximizes this objective for the combined portfolio:
slope_of_cumsum(R · w, last 365d)
─────────────────────────────────────
volatility(R · w, last 365d)
This is similar to a Sharpe ratio, but uses the linear slope of the cumulative returns curve rather than the arithmetic mean. Slope is more stable than mean for daily returns, which tend to be noisy — it measures trend consistency rather than average return magnitude.
The constraint is sum(w) = 1, w ≥ 0. This means all capital is allocated across the strategies, and no strategy can receive negative weight (no shorting strategies against each other). The optimizer runs with 10 random restarts to avoid local optima.
The manager has a separate concern from weight allocation: how much total leverage to apply to the portfolio.
After optimizing weights, manager.py computes the portfolio's realized annual volatility and scales it to match a target:
leverage = target_vol / portfolio_ann_vol
Default target is 30% annual volatility. If the portfolio's realized vol is 15%, leverage = 2x. If it's 40%, leverage = 0.75x (de-leveraged).
--target-vol 0.10. At 30% target vol, expect to see drawdowns in the −40–60% range in bad periods.
Run it before going live whenever:
python3 manager/management_backtest.py --lookback 365 --random-n 1000
Run manager.py separately to update live weights after the backtest confirms the portfolio is sound:
python3 manager/manager.py --lookback 365 --account 10000 --target-vol 0.30
| Warning sign | What it likely means |
|---|---|
| Managed Sharpe < random median | Dynamic allocation is hurting, not helping. Consider equal-weight allocation instead. |
| Weights oscillate wildly (bottom panel) | Strategies have similar edge — the optimizer is fitting noise. Use longer lookback or add more strategies. |
| One strategy receives >80% weight consistently | The other strategies are not contributing. Evaluate whether they're worth running at all. |
| MDD exceeds 2× individual strategy MDD | Strategies are drawing down together. They may be too correlated to provide diversification. |
| OOS period < 180 days | Not enough data — results are statistically unreliable. Run individual strategy backtests longer first. |