#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
5 min read min read

Linear Regression

Learn to predict numbers with Linear Regression

Linear Regression

What is Linear Regression?

Predicting a number by finding the best straight line through data.

y = mx + b
  • y: What we predict (target)
  • x: What we use to predict (feature)
  • m: Slope (how much y changes when x changes)
  • b: Intercept (y when x is 0)

Simple Example

code.py
from sklearn.linear_model import LinearRegression
import numpy as np

# Data: Study hours vs exam score
hours = np.array([[1], [2], [3], [4], [5]])
scores = np.array([50, 60, 70, 80, 90])

# Create and train model
model = LinearRegression()
model.fit(hours, scores)

# Predict
new_hours = np.array([[6]])
predicted_score = model.predict(new_hours)
print(f"6 hours of study → {predicted_score[0]:.0f} score")

Understanding the Model

code.py
print(f"Slope (m): {model.coef_[0]:.2f}")
print(f"Intercept (b): {model.intercept_:.2f}")

# Formula: score = 10 * hours + 40

Interpretation: Each extra hour of study adds 10 points!

Multiple Features

code.py
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# House price prediction
df = pd.DataFrame({
    'sqft': [1000, 1500, 1200, 1800, 2000, 2500],
    'bedrooms': [2, 3, 2, 3, 4, 4],
    'age': [20, 10, 15, 5, 8, 3],
    'price': [200000, 300000, 250000, 350000, 400000, 500000]
})

X = df[['sqft', 'bedrooms', 'age']]
y = df['price']

# Split and train
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)

# Coefficients
for feature, coef in zip(X.columns, model.coef_):
    print(f"{feature}: {coef:.2f}")

Model Evaluation

R² Score

How much variance the model explains (0 to 1):

code.py
from sklearn.metrics import r2_score

y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print(f"R² Score: {r2:.2f}")

# Or simply:
print(f"R² Score: {model.score(X_test, y_test):.2f}")
  • 1.0: Perfect prediction
  • 0.0: No better than predicting mean
  • < 0: Worse than predicting mean

Mean Squared Error (MSE)

Average squared difference:

code.py
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, y_pred)
print(f"MSE: {mse:.2f}")

Root Mean Squared Error (RMSE)

Same units as target:

code.py
rmse = np.sqrt(mse)
print(f"RMSE: {rmse:.2f}")  # Same units as price

Mean Absolute Error (MAE)

Average absolute difference:

code.py
from sklearn.metrics import mean_absolute_error

mae = mean_absolute_error(y_test, y_pred)
print(f"MAE: {mae:.2f}")

Visualizing Predictions

code.py
import matplotlib.pyplot as plt

# Simple linear regression visualization
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2.1, 4.2, 5.8, 8.1, 9.9])

model = LinearRegression()
model.fit(X, y)

# Plot
plt.scatter(X, y, color='blue', label='Actual')
plt.plot(X, model.predict(X), color='red', label='Predicted')
plt.xlabel('X')
plt.ylabel('y')
plt.legend()
plt.title('Linear Regression')
plt.show()

Residuals

Difference between actual and predicted:

code.py
y_pred = model.predict(X)
residuals = y - y_pred

# Residuals should be random, centered at 0
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

Feature Scaling

Important for comparison:

code.py
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LinearRegression()
model.fit(X_train_scaled, y_train)

# Now coefficients are comparable

Complete Example

code.py
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.datasets import fetch_california_housing
import numpy as np

# Load California housing dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

print(f"Features: {housing.feature_names}")
print(f"Samples: {X.shape[0]}")

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
y_pred = model.predict(X_test)

# Evaluate
print(f"\n=== Model Evaluation ===")
print(f"R² Score: {r2_score(y_test, y_pred):.3f}")
print(f"RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}")
print(f"MAE: {mean_absolute_error(y_test, y_pred):.3f}")

# Feature importance
print(f"\n=== Feature Coefficients ===")
for name, coef in zip(housing.feature_names, model.coef_):
    print(f"{name}: {coef:.4f}")

Key Points

  • Linear Regression predicts continuous numbers
  • measures how well model explains variance
  • RMSE and MAE measure prediction error
  • Check residuals for patterns
  • Use multiple features for better predictions
  • Coefficients show feature importance

What's Next?

Learn about Logistic Regression for classification.

SkillsetMaster - AI, Web Development & Data Analytics Courses