5 min read min read
Train-Test Split
Learn to properly split data for machine learning
Train-Test Split
Why Split Data?
We need to test our model on data it has never seen.
- Training data: Model learns from this
- Test data: We check how well it learned
If we test on training data, we don't know if model actually learned or just memorized!
Basic Split
code.py
from sklearn.model_selection import train_test_split
import numpy as np
# Sample data
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([2, 4, 6, 8, 10, 12, 14, 16, 18, 20])
# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2
)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")Common Split Ratios
| Split | Train | Test | When to Use |
|---|---|---|---|
| 80/20 | 80% | 20% | Most common |
| 70/30 | 70% | 30% | Small datasets |
| 90/10 | 90% | 10% | Large datasets |
Random State
For reproducible results:
code.py
# Without random_state: different split each time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# With random_state: same split every time
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)Stratified Split
For classification, keep class proportions:
code.py
# Imbalanced data: 90 class 0, 10 class 1
y = np.array([0]*90 + [1]*10)
X = np.random.randn(100, 2)
# Regular split might give all class 1 to test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")Train-Validation-Test Split
For model tuning, use 3 sets:
code.py
# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Second split: separate validation from training
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42 # 0.25 x 0.8 = 0.2
)
print(f"Train: {len(X_train)}") # 60%
print(f"Validation: {len(X_val)}") # 20%
print(f"Test: {len(X_test)}") # 20%Purpose:
- Train: Learn patterns
- Validation: Tune hyperparameters
- Test: Final evaluation (use only once!)
Shuffle
Data is shuffled by default. Turn off for time series:
code.py
# Time series: don't shuffle!
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, shuffle=False
)With Pandas DataFrames
code.py
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.DataFrame({
'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'feature2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
'target': [0, 0, 0, 1, 1, 1, 0, 1, 1, 0]
})
# Separate features and target
X = df[['feature1', 'feature2']]
y = df['target']
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")Data Leakage Warning!
Never let test data influence training:
code.py
# WRONG: Scaling before split
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Uses ALL data!
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)
# CORRECT: Scale after split
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Fit on train only
X_test_scaled = scaler.transform(X_test) # Just transform testComplete Example
code.py
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import numpy as np
# Load data
data = load_breast_cancer()
X = data.data
y = data.target
print(f"Total samples: {len(X)}")
print(f"Features: {X.shape[1]}")
print(f"Class distribution: {np.bincount(y)}")
# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"\nTrain samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Train class dist: {np.bincount(y_train)}")
print(f"Test class dist: {np.bincount(y_test)}")
# Scale (fit on train only!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train and evaluate
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)
train_acc = model.score(X_train_scaled, y_train)
test_acc = model.score(X_test_scaled, y_test)
print(f"\nTraining accuracy: {train_acc:.1%}")
print(f"Test accuracy: {test_acc:.1%}")Key Points
- Always split before any preprocessing
- Use test_size=0.2 for 80/20 split
- Use random_state for reproducibility
- Use stratify for classification
- Use shuffle=False for time series
- Train-Val-Test for hyperparameter tuning
- Avoid data leakage!
What's Next?
Learn about Linear Regression.