Train-Test Split

Why Split Data?

We need to test our model on data it has never seen.

Training data: Model learns from this
Test data: We check how well it learned

If we test on training data, we don't know if model actually learned or just memorized!

Basic Split

code.pyPython

from sklearn.model_selection import train_test_split
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4], [5], [6], [7], [8], [9], [10]])
y = np.array([2, 4, 6, 8, 10, 12, 14, 16, 18, 20])

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

Common Split Ratios

Split	Train	Test	When to Use
80/20	80%	20%	Most common
70/30	70%	30%	Small datasets
90/10	90%	10%	Large datasets

Random State

For reproducible results:

code.pyPython

# Without random_state: different split each time
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# With random_state: same split every time
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stratified Split

For classification, keep class proportions:

code.pyPython

# Imbalanced data: 90 class 0, 10 class 1
y = np.array([0]*90 + [1]*10)
X = np.random.randn(100, 2)

# Regular split might give all class 1 to test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"Train class distribution: {np.bincount(y_train)}")
print(f"Test class distribution: {np.bincount(y_test)}")

Train-Validation-Test Split

For model tuning, use 3 sets:

code.pyPython

# First split: separate test set
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Second split: separate validation from training
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42  # 0.25 x 0.8 = 0.2
)

print(f"Train: {len(X_train)}")      # 60%
print(f"Validation: {len(X_val)}")   # 20%
print(f"Test: {len(X_test)}")        # 20%

Purpose:

Train: Learn patterns
Validation: Tune hyperparameters
Test: Final evaluation (use only once!)

Shuffle

Data is shuffled by default. Turn off for time series:

code.pyPython

# Time series: don't shuffle!
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False
)

With Pandas DataFrames

code.pyPython

import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'feature2': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100],
    'target': [0, 0, 0, 1, 1, 1, 0, 1, 1, 0]
})

# Separate features and target
X = df[['feature1', 'feature2']]
y = df['target']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"X_train shape: {X_train.shape}")
print(f"y_train shape: {y_train.shape}")

Data Leakage Warning!

Never let test data influence training:

code.pyPython

# WRONG: Scaling before split
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses ALL data!
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y)

# CORRECT: Scale after split
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Fit on train only
X_test_scaled = scaler.transform(X_test)        # Just transform test

Complete Example

code.pyPython

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
import numpy as np

# Load data
data = load_breast_cancer()
X = data.data
y = data.target

print(f"Total samples: {len(X)}")
print(f"Features: {X.shape[1]}")
print(f"Class distribution: {np.bincount(y)}")

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"\nTrain samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Train class dist: {np.bincount(y_train)}")
print(f"Test class dist: {np.bincount(y_test)}")

# Scale (fit on train only!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train and evaluate
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

train_acc = model.score(X_train_scaled, y_train)
test_acc = model.score(X_test_scaled, y_test)

print(f"\nTraining accuracy: {train_acc:.1%}")
print(f"Test accuracy: {test_acc:.1%}")

Key Points

Always split before any preprocessing
Use test_size=0.2 for 80/20 split
Use random_state for reproducibility
Use stratify for classification
Use shuffle=False for time series
Train-Val-Test for hyperparameter tuning
Avoid data leakage!

What's Next?

Learn about Linear Regression.