AgenticTradingML — AI-Powered Autonomous Trading Platform

Evaluating Time Series Models

Standard machine learning cross-validation techniques break time series. Here's the right way.

Why Standard k-Fold is Wrong

python

# WRONG: Standard k-fold shuffle
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True)  # This shuffles time!
for train_idx, test_idx in kf.split(X):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]

    # Problem: X_train includes data from 2024
    # X_test includes data from 2020
    # We're training on the future and testing on the past!
    # Total nonsense for time series

TimeSeriesSplit: The Right Way

python

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
splits = list(tscv.split(X))

# Split 1:
# train: 2000-2004, test: 2005-2009
# Split 2:
# train: 2000-2009, test: 2010-2014
# Split 3:
# train: 2000-2014, test: 2015-2019
# ... expanding window

for train_idx, test_idx in tscv.split(X):
    X_train = X.iloc[train_idx]
    X_test = X.iloc[test_idx]

    # X_train is ALWAYS before X_test chronologically
    # No lookahead bias!

Purged k-Fold (Advanced)

For overlapping predictions, you need to "purge" test data near training data to avoid leakage.

python

# If each prediction uses 20 days of data,
# don't test within 20 days after training ends
# (predictions overlap, so earlier test data leaks info into later training)

# This is complex but important for high-frequency strategies

Metrics That Matter for Trading

Metric	Calculation	Trading Meaning
Directional Accuracy	% of times model direction matches actual direction	Core metric - can it pick winners?
Sharpe Ratio	(Return - Rf) / Volatility	Risk-adjusted return when applied to portfolio
Prediction Recall	% of big up moves we catch	Do we get the important moves?
Max Drawdown	Biggest peak-to-trough loss	Can your account survive it?

The Regime Test

Strategies that work in 2010-2015 (bull market) might fail in 2018 (correction) or 2020 (crash). Test across different market regimes.

python

# Calculate model accuracy by market condition
df['VIX'] = get_vix_data()  # Get volatility index

# Split by regime
calm_periods = df[df['VIX'] < 20]
stressed_periods = df[df['VIX'] > 25]

accuracy_calm = calculate_accuracy(model, calm_periods)
accuracy_stressed = calculate_accuracy(model, stressed_periods)

print(f'Accuracy in calm markets: {accuracy_calm:.3f}')
print(f'Accuracy in stressed markets: {accuracy_stressed:.3f}')

# If stressed accuracy is much worse, your model needs work

Out-of-Sample is Sacred

Never touch your out-of-sample test set until final evaluation. NEVER retrain your model based on test set performance. That's cheating. Use validation set for tuning, test set for final evaluation only.

Model Evaluation for Time Series

Evaluating Time Series Models

Why Standard k-Fold is Wrong

TimeSeriesSplit: The Right Way

Purged k-Fold (Advanced)

Metrics That Matter for Trading

The Regime Test

Out-of-Sample is Sacred