5 min read min read
Linear Regression
Learn to predict numbers with Linear Regression
Linear Regression
What is Linear Regression?
Predicting a number by finding the best straight line through data.
y = mx + b
- y: What we predict (target)
- x: What we use to predict (feature)
- m: Slope (how much y changes when x changes)
- b: Intercept (y when x is 0)
Simple Example
code.py
from sklearn.linear_model import LinearRegression
import numpy as np
# Data: Study hours vs exam score
hours = np.array([[1], [2], [3], [4], [5]])
scores = np.array([50, 60, 70, 80, 90])
# Create and train model
model = LinearRegression()
model.fit(hours, scores)
# Predict
new_hours = np.array([[6]])
predicted_score = model.predict(new_hours)
print(f"6 hours of study → {predicted_score[0]:.0f} score")Understanding the Model
code.py
print(f"Slope (m): {model.coef_[0]:.2f}")
print(f"Intercept (b): {model.intercept_:.2f}")
# Formula: score = 10 * hours + 40Interpretation: Each extra hour of study adds 10 points!
Multiple Features
code.py
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
# House price prediction
df = pd.DataFrame({
'sqft': [1000, 1500, 1200, 1800, 2000, 2500],
'bedrooms': [2, 3, 2, 3, 4, 4],
'age': [20, 10, 15, 5, 8, 3],
'price': [200000, 300000, 250000, 350000, 400000, 500000]
})
X = df[['sqft', 'bedrooms', 'age']]
y = df['price']
# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
# Coefficients
for feature, coef in zip(X.columns, model.coef_):
print(f"{feature}: {coef:.2f}")Model Evaluation
R² Score
How much variance the model explains (0 to 1):
code.py
from sklearn.metrics import r2_score
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.2f}")
# Or simply:
print(f"R² Score: {model.score(X_test, y_test):.2f}")- 1.0: Perfect prediction
- 0.0: No better than predicting mean
- < 0: Worse than predicting mean
Mean Squared Error (MSE)
Average squared difference:
code.py
from sklearn.metrics import mean_squared_error
mse = mean_squared_error(y_test, y_pred)
print(f"MSE: {mse:.2f}")Root Mean Squared Error (RMSE)
Same units as target:
code.py
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.2f}") # Same units as priceMean Absolute Error (MAE)
Average absolute difference:
code.py
from sklearn.metrics import mean_absolute_error
mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.2f}")Visualizing Predictions
code.py
import matplotlib.pyplot as plt
# Simple linear regression visualization
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2.1, 4.2, 5.8, 8.1, 9.9])
model = LinearRegression()
model.fit(X, y)
# Plot
plt.scatter(X, y, color='blue', label='Actual')
plt.plot(X, model.predict(X), color='red', label='Predicted')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.title('Linear Regression')
plt.show()Residuals
Difference between actual and predicted:
code.py
y_pred = model.predict(X)
residuals = y - y_pred
# Residuals should be random, centered at 0
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()Feature Scaling
Important for comparison:
code.py
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LinearRegression()
model.fit(X_train_scaled, y_train)
# Now coefficients are comparableComplete Example
code.py
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.datasets import fetch_california_housing
import numpy as np
# Load California housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target
print(f"Features: {housing.feature_names}")
print(f"Samples: {X.shape[0]}")
# Split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train
model = LinearRegression()
model.fit(X_train, y_train)
# Predict
y_pred = model.predict(X_test)
# Evaluate
print(f"\n=== Model Evaluation ===")
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f}")
# Feature importance
print(f"\n=== Feature Coefficients ===")
for name, coef in zip(housing.feature_names, model.coef_):
print(f"{name}: {coef:.4f}")Key Points
- Linear Regression predicts continuous numbers
- R² measures how well model explains variance
- RMSE and MAE measure prediction error
- Check residuals for patterns
- Use multiple features for better predictions
- Coefficients show feature importance
What's Next?
Learn about Logistic Regression for classification.