Linear Regression

What is Linear Regression?

Predicting a number by finding the best straight line through data.

y = mx + b

y: What we predict (target)
x: What we use to predict (feature)
m: Slope (how much y changes when x changes)
b: Intercept (y when x is 0)

Simple Example

code.py

from sklearn.linear_model import LinearRegression
import numpy as np

# Data: Study hours vs exam score
hours = np.array([[1], [2], [3], [4], [5]])
scores = np.array([50, 60, 70, 80, 90])

# Create and train model
model = LinearRegression()
model.fit(hours, scores)

# Predict
new_hours = np.array([[6]])
predicted_score = model.predict(new_hours)
print(f"6 hours of study → {predicted_score[0]:.0f} score")

Understanding the Model

code.py

print(f"Slope (m): {model.coef_[0]:.2f}")
print(f"Intercept (b): {model.intercept_:.2f}")

# Formula: score = 10 * hours + 40

Interpretation: Each extra hour of study adds 10 points!

Multiple Features

code.py

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# House price prediction
df = pd.DataFrame({
    'sqft': [1000, 1500, 1200, 1800, 2000, 2500],
    'bedrooms': [2, 3, 2, 3, 4, 4],
    'age': [20, 10, 15, 5, 8, 3],
    'price': [200000, 300000, 250000, 350000, 400000, 500000]
})

X = df[['sqft', 'bedrooms', 'age']]
y = df['price']

# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)

# Coefficients
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef:.2f}")

Model Evaluation

R² Score

How much variance the model explains (0 to 1):

code.py

from sklearn.metrics import r2_score

y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.2f}")

# Or simply:
print(f"R² Score: {model.score(X_test, y_test):.2f}")

1.0: Perfect prediction
0.0: No better than predicting mean
< 0: Worse than predicting mean

Mean Squared Error (MSE)

Average squared difference:

code.py

from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
print(f"MSE: {mse:.2f}")

Root Mean Squared Error (RMSE)

Same units as target:

code.py

rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.2f}")  # Same units as price

Mean Absolute Error (MAE)

Average absolute difference:

code.py

from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.2f}")

Visualizing Predictions

code.py

import matplotlib.pyplot as plt

# Simple linear regression visualization
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2.1, 4.2, 5.8, 8.1, 9.9])

model = LinearRegression()
model.fit(X, y)

# Plot
plt.scatter(X, y, color='blue', label='Actual')
plt.plot(X, model.predict(X), color='red', label='Predicted')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.title('Linear Regression')
plt.show()

Residuals

Difference between actual and predicted:

code.py

y_pred = model.predict(X)
residuals = y - y_pred

# Residuals should be random, centered at 0
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

Feature Scaling

Important for comparison:

code.py

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Now coefficients are comparable

Complete Example

code.py

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.datasets import fetch_california_housing
import numpy as np

# Load California housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

print(f"Features: {housing.feature_names}")
print(f"Samples: {X.shape[0]}")

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print(f"\n=== Model Evaluation ===")
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f}")

# Feature importance
print(f"\n=== Feature Coefficients ===")
for name, coef in zip(housing.feature_names, model.coef_):
    print(f"{name}: {coef:.4f}")

Key Points

Linear Regression predicts continuous numbers
R² measures how well model explains variance
RMSE and MAE measure prediction error
Check residuals for patterns
Use multiple features for better predictions
Coefficients show feature importance

What's Next?

Learn about Logistic Regression for classification.