ML for TradingLesson 1
Feature Engineering: Creating Predictive Inputs
Learn how to create features from price data that machine learning models can use to predict price movements.
12 minute read
4 key takeaways
Feature Engineering: The Art of Prep
Machine learning models are only as good as the data you feed them. Feature engineering is the process of creating useful input variables from raw price data.
Price-Based Features
- Returns: (P_t - P_{t-1}) / P_{t-1} (simple percentage change)
- Log Returns: ln(P_t / P_{t-1}) (mathematically cleaner for modeling)
- Rolling Mean: Average price over last N periods (trend)
- Rolling Std: Price volatility over last N periods
- Price vs. SMA: How far price is from its average (reversion signal)
- High/Low Range: (High - Low) / Close (intraday volatility)
python
import pandas as pd
import numpy as np
df = pd.read_csv('SPY.csv')
# Create features
df['Return'] = df['Close'].pct_change()
df['Log_Return'] = np.log(df['Close'] / df['Close'].shift(1))
df['SMA_20'] = df['Close'].rolling(20).mean()
df['STD_20'] = df['Close'].rolling(20).std()
df['Price_vs_SMA'] = (df['Close'] - df['SMA_20']) / df['SMA_20']
df['Range'] = (df['High'] - df['Low']) / df['Close']
Technical Indicator Features
- RSI: Momentum 0-100
- MACD: Trend-following indicator
- ATR: Volatility measure
- OBV: On-Balance Volume (accumulation/distribution)
- Bollinger Bands: Overbought/oversold zones
Important: All features should be normalized to the same scale (usually [-1, 1] or [0, 1]) so the model doesn't overweight large-magnitude features.
python
from sklearn.preprocessing import StandardScaler
# Normalize features to mean=0, std=1
scaler = StandardScaler()
features_scaled = scaler.fit_transform(df[['RSI', 'MACD', 'ATR']])
df['RSI_scaled'] = features_scaled[:, 0]
df['MACD_scaled'] = features_scaled[:, 1]
df['ATR_scaled'] = features_scaled[:, 2]
Volume Features
- Volume Ratio: Today's volume / 20-day average
- Volume Trend: Is volume increasing or decreasing?
- VWAP Deviation: How far from volume-weighted average price
Time Features (Seasonality)
- Day of Week: Monday, Tuesday, etc. (evidence of different momentum)
- Month: Certain months have higher returns (January Effect)
- Quarter End: End-of-period portfolio rebalancing effects
Label Creation: What Are We Predicting?
- Binary Classification: Will price be higher in 5 days? (1=yes, 0=no)
- Multi-Class: Up/Down/Flat over next 5 days
- Regression: Predict the exact return (%, continuous number)
- Direction Only: Will it go up or down? (ignore magnitude)
python
# Create binary label: will price be higher in 5 days?
df['Return_5d'] = df['Close'].shift(-5) / df['Close'] - 1
df['Label'] = (df['Return_5d'] > 0).astype(int) # 1 if up, 0 if down
# Shift features so we're only using data available at decision time
for col in feature_columns:
df[col] = df[col].shift(1) # Shift to prevent lookahead bias
# Remove NaN rows
df = df.dropna()
Watch for Lookahead Bias
CRITICAL: Create your label using forward prices (5 days ahead). But create your features using only past prices. If you use day t's close in features AND day t's close in label, you're cheating.
Key Takeaways
- Features are the inputs to ML models—bad features = bad predictions
- Price returns and volatility are foundational features
- Technical indicators work as features but must be scaled properly
- More features isn't better; quality and diversity matter