#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
Module 6
12 min read

Simple Linear Regression

Build predictive models with linear regression

What You'll Learn

  • What linear regression is
  • Understanding the regression line
  • Interpreting slope and intercept
  • R-squared and model fit
  • Making predictions

Linear Regression Basics

Linear Regression Line

Purpose: Model relationship between two variables and make predictions

Goal: Find the best-fit line through data points

Equation: Y = β₀ + β₁X + ε

Where:

  • Y = dependent variable (what we predict)
  • X = independent variable (predictor)
  • β₀ = intercept (Y when X=0)
  • β₁ = slope (change in Y per unit of X)
  • ε = error term

The Regression Line

What it does: Minimizes total squared errors (vertical distances from points to line)

Method: Ordinary Least Squares (OLS)

Result: Best-fit line: ŷ = b₀ + b₁x

Example: Sales = 1000 + 50 × (Ad Spend)

  • Intercept: $1000 baseline sales
  • Slope: Each $1 in ads increases sales by $50

Interpreting the Slope

Slope (b₁): Change in Y for one-unit increase in X

Examples:

Positive slope: Sales = 100 + 5(Price) "Each $1 price increase → $5 more revenue"

Negative slope: Demand = 1000 - 20(Price) "Each $1 price increase → 20 fewer units sold"

Zero slope: No relationship between X and Y

Interpreting the Intercept

Intercept (b₀): Predicted Y when X = 0

Example: Test Score = 50 + 10(Study Hours)

  • Intercept: 50 points with zero study
  • Realistic? Maybe not! (extrapolation issue)

Warning: Only meaningful if X=0 makes sense in your context

R-Squared (R²)

R-Squared Visualization

What it measures: How much variance in Y is explained by X

Range: 0 to 1 (or 0% to 100%)

Interpretation:

  • R² = 0.80: "80% of variance explained"
  • R² = 0.30: "30% of variance explained"

Guidelines:

  • R² > 0.7: Strong relationship
  • R² = 0.3-0.7: Moderate
  • R² < 0.3: Weak

Important: High R² doesn't mean causation!

Residuals

What they are: Actual Y - Predicted Y

Why important: Show how well model fits

Good residuals:

  • Randomly scattered
  • No pattern
  • Normally distributed

Bad residuals:

  • Curved pattern (nonlinear relationship!)
  • Increasing spread (heteroscedasticity)
  • Outliers

Making Predictions

Process:

  1. Fit regression: ŷ = 50 + 2x
  2. Plug in X value: x = 10
  3. Calculate: ŷ = 50 + 2(10) = 70

Example: Height = 60 + 2.5(Age) Predict height at age 10: Height = 60 + 2.5(10) = 85 inches

Caution: Don't extrapolate beyond data range!

Excel Implementation

Steps:

  1. Plot scatter chart
  2. Add trendline
  3. Display equation and R²

Formulas:

  • Slope: =SLOPE(Y_range, X_range)
  • Intercept: =INTERCEPT(Y_range, X_range)
  • R²: =RSQ(Y_range, X_range)
  • Predict: =FORECAST(new_x, Y_range, X_range)

Analysis ToolPak: Data → Data Analysis → Regression

Python Implementation

from sklearn.linear_model import LinearRegression
import numpy as np

# Data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])

# Fit model
model = LinearRegression()
model.fit(X, y)

# Get coefficients
print(f"Intercept: {model.intercept_}")
print(f"Slope: {model.coef_[0]}")
print(f"R-squared: {model.score(X, y)}")

# Predict
new_value = model.predict([[6]])

Real-World Applications

Marketing: Sales vs advertising spend

Finance: Stock returns vs market returns (beta)

HR: Salary vs years of experience

Real Estate: House price vs square footage

Healthcare: Blood pressure vs age

Example: Advertising ROI

Data: Ad Spend ($): 100, 200, 300, 400, 500 Sales ($): 500, 900, 1200, 1600, 1900

Regression: Sales = 150 + 3.5(Ad Spend) R² = 0.98

Interpretation:

  • $150 baseline sales
  • Each $1 in ads → $3.50 in sales
  • Very strong fit (98% explained)

Decision: ROI = $3.50 - $1 = $2.50 profit per ad dollar → Keep advertising!

Correlation vs Regression

Correlation (r):

  • Measures strength of relationship
  • No prediction
  • Symmetric (r(X,Y) = r(Y,X))

Regression:

  • Predicts Y from X
  • Has equation
  • Asymmetric (different if you swap X and Y)

Relationship: r² = R² (in simple linear regression)

Common Mistakes

1. Assuming causation Correlation ≠ Causation!

2. Extrapolating Don't predict outside data range

3. Ignoring residuals Check assumptions!

4. Using when nonlinear Curved relationship? Use different model

5. Ignoring outliers One point can change entire line

Practice Exercise

Data: Years Experience: 1, 2, 3, 4, 5 Salary ($1000s): 40, 45, 55, 60, 70

Tasks:

  1. Calculate slope and intercept
  2. Interpret the slope
  3. Predict salary at 6 years
  4. Calculate R²

Answers:

  1. Salary = 32 + 7.5(Years)
  2. Each year → $7,500 increase
  3. Salary = 32 + 7.5(6) = $77k
  4. R² ≈ 0.95 (strong fit)

Next Steps

Learn about Model Assumptions!

Tip: Always plot your data before fitting regression!

SkillsetMaster - AI, Web Development & Data Analytics Courses