Academy/ML for Trading/Avoiding Lookahead Bias & Data Leakage
ML for TradingLesson 3

Avoiding Lookahead Bias & Data Leakage

The #1 mistake in ML trading: accidentally using future information. Learn how to avoid it.

10 minute read
4 key takeaways

Lookahead Bias: The #1 ML Trading Mistake

Most beginner ML trading models fail because they're cheating—using information they shouldn't have. Here's how to avoid it.

What IS Lookahead Bias?

  • Using today's close to generate today's signal (but you trade at yesterday's close)
  • Calculating indicators using future data
  • Normalizing features with statistics from the entire dataset (including future)
  • Using the target variable in feature creation

Example: The Wrong Way vs Right Way

python
# WRONG: Lookahead bias
df['RSI'] = calculate_rsi(df['Close'], 14)  # Uses entire history including future!
df['Signal'] = df['RSI'] < 30
df['Return_5d'] = df['Close'].shift(-5) / df['Close']

# When we backtest:
# - At day 100, we calculate RSI using data up to day 100 (and beyond)
# - We predict based on RSI that includes info from day 101-115
# - We check if return from day 105-110 was positive
# - This is pure luck, not predictive power


# RIGHT: No lookahead
df['RSI'] = 0
for i in range(14, len(df)):
    # RSI at time i uses only data up to time i
    df.loc[i, 'RSI'] = calculate_rsi(df['Close'][:i], 14)

df['Signal'] = df['RSI'] < 30
df['Return_5d'] = df['Close'].shift(-5) / df['Close']

# When we backtest:
# - At day 100, we calculate RSI using only data up to day 100
# - Signal is based on past data only
# - Return is future return (days 105-110)
# - This properly tests predictive power

The Shift() Function: Your Friend

Use .shift() to ensure data doesn't leak from future to past.

python
# Create label: will price be higher in 5 days?
df['Return_5d'] = df['Close'].shift(-5) / df['Close'] - 1
df['Label'] = (df['Return_5d'] > 0).astype(int)

# Now create features using ONLY past data
df['SMA_20'] = df['Close'].rolling(20).mean()
df['Features_available_at_t'] = df['SMA_20'].shift(1)  # SHIFT TO AVOID LOOKAHEAD

# Now at row 100:
# - df.loc[100, 'Features_available_at_t'] has data from row 99 and earlier
# - df.loc[100, 'Label'] has the actual return from rows 105-110
# - No cheating!

Survivorship Bias

Many stocks go bankrupt. If you only backtest using "stocks that survived," you're excluding all your biggest losses.

  • Enron, Blockbuster, Lehman Brothers: all had great trends right up until bankruptcy
  • Your backtest looks great because you ignored the -100% returns
  • Solution: Use a universe that existed in the past (even if delisted now)

Normalization Mistake: Fit on Training, Not All Data

python
# WRONG: Fit scaler on entire dataset (includes test data!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Fit on ALL data - LOOKAHEAD!
X_train, X_test = train_test_split(X_scaled)

# Model sees statistics from test set in training


# RIGHT: Fit scaler on training data only
X_train, X_test = train_test_split(X)
scaler = StandardScaler()
scaler.fit(X_train)  # Fit ONLY on training data
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)  # Transform test with training stats

# Test data is truly held out

These Mistakes are Insidious

Lookahead bias is easy to introduce accidentally. Your backtest will look amazing (because you're cheating), then live trading will lose money (because you're not actually predictive). Always be paranoid about data leakage.

Key Takeaways
  • Lookahead bias: Using information not available at decision time
  • Data leakage: Information from the target leaking into features
  • Survivorship bias: Only testing on stocks that survived
  • Simple rule: Use only data that existed before the prediction date