You ran a backtest. The equity curve goes up. The profit factor is above 1. You're ready to trade it live, right?

Not so fast. A positive backtest is the beginning of the evaluation process, not the end. Most strategies with good-looking results are the product of curve fitting, optimization bias, or market conditions that won't repeat. And the standard metrics — net profit, win rate, profit factor — can't tell you the difference.

This is Part 2 of our NinjaTrader backtesting series. Part 1 covered how to generate backtest results. This article tackles the harder question: how do you know if those results actually mean anything?


Why Standard Metrics Fail

A strategy showing $30,000 in profit with a 67% win rate and a 1.25 profit factor could be either of these:

  • A robust edge that will continue performing in live trading, or
  • A curve-fit artifact that happened to look good on historical data but has zero predictive value going forward.

Net profit, win rate, and profit factor can't distinguish between the two. You need deeper analysis. The full article walks through every metric using two real strategies — one that scores 100/100 on robustness, and one that scores 0/100.

A Preview: The Curve-Fit Killer

One of the most powerful tests is deceptively simple: split your backtest in half and compare the two halves.

A robust strategy should perform reasonably well in both halves. If only the second half is profitable, that's the classic curve-fit signature — the parameters were optimized on recent data and the “backtest” is really just showing you the optimization period.

Here's the question that makes this test devastating: “If you'd gone live at the halfway point, would you have stayed live?”

Strategy A (100/100 robustness): First half PF 1.32, second half PF 1.27. Midpoint equity: +$17,993. You'd have stayed live. ✓

Strategy B (0/100 robustness): First half PF 0.93, second half PF 1.80. Midpoint equity: -$29,301. You'd have pulled the plug. ✗

Strategy B has twice as many trades as Strategy A. More data should mean more confidence — but volume without edge is meaningless. Strategy B traded 7,203 times and still couldn't produce a statistically significant positive expectancy (p = 0.715). A coin flip with a commission would produce similar results.

Read the Full Analysis

The complete article covers every metric in the robustness framework with side-by-side comparisons:

  • Equity curve R² — measuring consistency, not just profitability
  • System Quality Number (SQN) — Van Tharp's system quality metric
  • Stability analysis — the midpoint reality check and curve-fit detection
  • Monte Carlo simulation — 5,000 alternate trade sequences and probability of profit
  • Edge detection — p-values, confidence intervals, and statistical significance
  • Noise testing — does the strategy survive real-world slippage?
  • E-Ratio — are your entries actually capturing directional movement?
  • Risk of ruin — critical for prop firm traders with hard drawdown limits
  • Red flags checklist — profit concentration, optimization sensitivity, and more

The article also introduces the Aeromir Robustness Analyzer — upload your Strategy Analyzer trade export and it runs every test automatically, producing a score from 0 to 100.

Read the full article →