5 min read min read
Cross-Validation
Learn to validate models more reliably
Cross-Validation
Why Cross-Validation?
A single train-test split might give lucky (or unlucky) results.
Cross-validation tests model on multiple splits for reliable evaluation.
K-Fold Cross-Validation
Split data into K parts, train K times:
| Fold | Part 1 | Part 2 | Part 3 | Part 4 | Part 5 |
|---|---|---|---|---|---|
| 1 | Test | Train | Train | Train | Train |
| 2 | Train | Test | Train | Train | Train |
| 3 | Train | Train | Test | Train | Train |
| 4 | Train | Train | Train | Test | Train |
| 5 | Train | Train | Train | Train | Test |
Every sample is tested once!
Basic Usage
code.py
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Model
model = LogisticRegression(max_iter=1000)
# 5-fold cross-validation
scores = cross_val_score(model, X, y, cv=5)
print(f"Scores: {scores}")
print(f"Mean: {scores.mean():.3f}")
print(f"Std: {scores.std():.3f}")Different Metrics
code.py
# Accuracy (default)
scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
# F1 Score
scores = cross_val_score(model, X, y, cv=5, scoring='f1_weighted')
# ROC AUC (binary only)
scores = cross_val_score(model, X, y, cv=5, scoring='roc_auc_ovr')
# For regression
from sklearn.linear_model import LinearRegression
scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='r2')
scores = cross_val_score(LinearRegression(), X, y, cv=5, scoring='neg_mean_squared_error')KFold Object
More control over splits:
code.py
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=kfold)
print(f"Mean: {scores.mean():.3f}")Stratified K-Fold
Keeps class proportions (for classification):
code.py
from sklearn.model_selection import StratifiedKFold
skfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skfold)
print(f"Mean: {scores.mean():.3f}")Use for imbalanced datasets!
Leave-One-Out (LOO)
Each sample is test set once:
code.py
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
scores = cross_val_score(model, X, y, cv=loo)
print(f"Scores: {len(scores)} (one per sample)")
print(f"Mean: {scores.mean():.3f}")Use for very small datasets
Cross-Validate with Multiple Metrics
code.py
from sklearn.model_selection import cross_validate
scoring = ['accuracy', 'precision_weighted', 'recall_weighted', 'f1_weighted']
results = cross_validate(model, X, y, cv=5, scoring=scoring)
for metric in scoring:
key = f'test_{metric}'
print(f"{metric}: {results[key].mean():.3f} (+/- {results[key].std():.3f})")Get Predictions
code.py
from sklearn.model_selection import cross_val_predict
# Get predictions for all samples
y_pred = cross_val_predict(model, X, y, cv=5)
# Now you can use these for confusion matrix, etc.
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y, y_pred))Nested Cross-Validation
For hyperparameter tuning + evaluation:
code.py
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.svm import SVC
# Inner CV: Find best parameters
param_grid = {'C': [0.1, 1, 10], 'kernel': ['rbf', 'linear']}
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
grid_search = GridSearchCV(SVC(), param_grid, cv=inner_cv)
# Outer CV: Evaluate model with tuning
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(grid_search, X, y, cv=outer_cv)
print(f"Nested CV Score: {scores.mean():.3f} (+/- {scores.std():.3f})")Time Series Split
For time-based data (no shuffling!):
code.py
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
print(f"Train: {len(train_idx)}, Test: {len(test_idx)}")Complete Example
code.py
from sklearn.model_selection import cross_validate, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
import numpy as np
# Load data
data = load_breast_cancer()
X, y = data.data, data.target
# Create pipeline (scaling + model)
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LogisticRegression(max_iter=1000))
])
# Stratified 5-fold CV
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Multiple metrics
scoring = {
'accuracy': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1',
'auc': 'roc_auc'
}
# Cross-validate
results = cross_validate(pipeline, X, y, cv=cv, scoring=scoring)
# Print results
print("=== Cross-Validation Results ===")
print(f"Samples: {len(X)}, Features: {X.shape[1]}")
print(f"\n5-Fold CV Scores:")
for metric in scoring:
scores = results[f'test_{metric}']
print(f" {metric:10}: {scores.mean():.3f} (+/- {scores.std():.3f})")Choosing K
| K | Pros | Cons |
|---|---|---|
| 5 | Good balance | Most common |
| 10 | More reliable | Slower |
| LOO | Uses all data | Very slow |
Key Points
- Cross-validation gives reliable estimates
- Use K=5 or 10 for most cases
- Use StratifiedKFold for classification
- Use TimeSeriesSplit for time data
- Report mean ± std of scores
- Use cross_val_predict for confusion matrix
- Nested CV for hyperparameter tuning
What's Next?
Learn about Feature Engineering basics.