ML for TradingLesson 4
Model Evaluation for Time Series
Learn why standard cross-validation is wrong for time series and how to properly evaluate ML trading models.
11 minute read
4 key takeaways
Evaluating Time Series Models
Standard machine learning cross-validation techniques break time series. Here's the right way.
Why Standard k-Fold is Wrong
python
# WRONG: Standard k-fold shuffle
from sklearn.model_selection import KFold
kf = KFold(n_splits=5, shuffle=True) # This shuffles time!
for train_idx, test_idx in kf.split(X):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
# Problem: X_train includes data from 2024
# X_test includes data from 2020
# We're training on the future and testing on the past!
# Total nonsense for time series
TimeSeriesSplit: The Right Way
python
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
splits = list(tscv.split(X))
# Split 1:
# train: 2000-2004, test: 2005-2009
# Split 2:
# train: 2000-2009, test: 2010-2014
# Split 3:
# train: 2000-2014, test: 2015-2019
# ... expanding window
for train_idx, test_idx in tscv.split(X):
X_train = X.iloc[train_idx]
X_test = X.iloc[test_idx]
# X_train is ALWAYS before X_test chronologically
# No lookahead bias!
Purged k-Fold (Advanced)
For overlapping predictions, you need to "purge" test data near training data to avoid leakage.
python
# If each prediction uses 20 days of data,
# don't test within 20 days after training ends
# (predictions overlap, so earlier test data leaks info into later training)
# This is complex but important for high-frequency strategies
Metrics That Matter for Trading
| Metric | Calculation | Trading Meaning |
|---|---|---|
| Directional Accuracy | % of times model direction matches actual direction | Core metric - can it pick winners? |
| Sharpe Ratio | (Return - Rf) / Volatility | Risk-adjusted return when applied to portfolio |
| Prediction Recall | % of big up moves we catch | Do we get the important moves? |
| Max Drawdown | Biggest peak-to-trough loss | Can your account survive it? |
The Regime Test
Strategies that work in 2010-2015 (bull market) might fail in 2018 (correction) or 2020 (crash). Test across different market regimes.
python
# Calculate model accuracy by market condition
df['VIX'] = get_vix_data() # Get volatility index
# Split by regime
calm_periods = df[df['VIX'] < 20]
stressed_periods = df[df['VIX'] > 25]
accuracy_calm = calculate_accuracy(model, calm_periods)
accuracy_stressed = calculate_accuracy(model, stressed_periods)
print(f'Accuracy in calm markets: {accuracy_calm:.3f}')
print(f'Accuracy in stressed markets: {accuracy_stressed:.3f}')
# If stressed accuracy is much worse, your model needs work
Out-of-Sample is Sacred
Never touch your out-of-sample test set until final evaluation. NEVER retrain your model based on test set performance. That's cheating. Use validation set for tuning, test set for final evaluation only.
Key Takeaways
- Standard k-fold cross-validation is wrong—it shuffles time series data
- Time-series split (walk-forward): train on past, test on future
- Never look ahead in time
- Metrics that matter: directional accuracy, Sharpe ratio, max drawdown