Academy/ML for Trading/Model Evaluation for Time Series
ML for TradingLesson 4

Model Evaluation for Time Series

Learn why standard cross-validation is wrong for time series and how to properly evaluate ML trading models.

11 minute read
4 key takeaways

Evaluating Time Series Models

Standard machine learning cross-validation techniques break time series. Here's the right way.

Why Standard k-Fold is Wrong

python
# WRONG: Standard k-fold shuffle
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True)  # This shuffles time!
for train_idx, test_idx in kf.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]

    # Problem: X_train includes data from 2024
    # X_test includes data from 2020
    # We're training on the future and testing on the past!
    # Total nonsense for time series

TimeSeriesSplit: The Right Way

python
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
splits = list(tscv.split(X))

# Split 1:
# train: 2000-2004, test: 2005-2009
# Split 2:
# train: 2000-2009, test: 2010-2014
# Split 3:
# train: 2000-2014, test: 2015-2019
# ... expanding window

for train_idx, test_idx in tscv.split(X):
    X_train = X.iloc[train_idx]
    X_test = X.iloc[test_idx]

    # X_train is ALWAYS before X_test chronologically
    # No lookahead bias!

Purged k-Fold (Advanced)

For overlapping predictions, you need to "purge" test data near training data to avoid leakage.

python
# If each prediction uses 20 days of data,
# don't test within 20 days after training ends
# (predictions overlap, so earlier test data leaks info into later training)

# This is complex but important for high-frequency strategies

Metrics That Matter for Trading

MetricCalculationTrading Meaning
Directional Accuracy% of times model direction matches actual directionCore metric - can it pick winners?
Sharpe Ratio(Return - Rf) / VolatilityRisk-adjusted return when applied to portfolio
Prediction Recall% of big up moves we catchDo we get the important moves?
Max DrawdownBiggest peak-to-trough lossCan your account survive it?

The Regime Test

Strategies that work in 2010-2015 (bull market) might fail in 2018 (correction) or 2020 (crash). Test across different market regimes.

python
# Calculate model accuracy by market condition
df['VIX'] = get_vix_data()  # Get volatility index

# Split by regime
calm_periods = df[df['VIX'] < 20]
stressed_periods = df[df['VIX'] > 25]

accuracy_calm = calculate_accuracy(model, calm_periods)
accuracy_stressed = calculate_accuracy(model, stressed_periods)

print(f'Accuracy in calm markets: {accuracy_calm:.3f}')
print(f'Accuracy in stressed markets: {accuracy_stressed:.3f}')

# If stressed accuracy is much worse, your model needs work

Out-of-Sample is Sacred

Never touch your out-of-sample test set until final evaluation. NEVER retrain your model based on test set performance. That's cheating. Use validation set for tuning, test set for final evaluation only.

Key Takeaways
  • Standard k-fold cross-validation is wrong—it shuffles time series data
  • Time-series split (walk-forward): train on past, test on future
  • Never look ahead in time
  • Metrics that matter: directional accuracy, Sharpe ratio, max drawdown