Topic 87 of

E-commerce Sales Analysis Project — Complete Python Tutorial

This project covers everything recruiters look for: data cleaning, SQL, Python, customer segmentation, cohorts, and a Power BI dashboard. Clone this repo and you have a portfolio-ready project in 2 hours.

📚Intermediate
⏱️20 min
5 quizzes
🎯

Project Overview

What You'll Build:

A complete e-commerce analytics project analyzing 500K+ transactions from an online retail store, answering business questions like:

  1. Customer Segmentation: Who are our best customers? (RFM analysis)
  2. Cohort Retention: How many customers return each month?
  3. Sales Trends: What drives revenue growth?
  4. Product Analysis: Which products are frequently bought together?
  5. Geographic Insights: Which countries generate most revenue?

Skills Demonstrated:

  • ✅ Data cleaning (handling nulls, duplicates, outliers)
  • ✅ SQL queries (aggregations, JOINs, window functions)
  • ✅ Python (pandas, matplotlib, seaborn)
  • ✅ RFM analysis & customer segmentation
  • ✅ Cohort retention analysis
  • ✅ Power BI dashboard (interactive visualizations)

Dataset:

Online Retail Dataset (Kaggle)

  • Size: 541,909 transactions
  • Period: Dec 2010 - Dec 2011 (12 months)
  • Features: InvoiceNo, StockCode, Description, Quantity, InvoiceDate, UnitPrice, CustomerID, Country
  • Download: Kaggle Link

Project Structure:

ecommerce-analysis/ ├── data/ │ ├── online_retail.csv │ └── cleaned_data.csv ├── notebooks/ │ ├── 01_data_cleaning.ipynb │ ├── 02_exploratory_analysis.ipynb │ ├── 03_rfm_segmentation.ipynb │ └── 04_cohort_analysis.ipynb ├── sql/ │ └── queries.sql ├── dashboard/ │ └── sales_dashboard.pbix └── README.md
🧹

Step 1: Data Cleaning

Load and Inspect Data:

code.pyPython
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Load dataset
df = pd.read_csv('online_retail.csv', encoding='latin1')

# Basic inspection
print(f"Dataset shape: {df.shape}")
print(f"\nColumn types:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nFirst few rows:\n{df.head()}")

Output:

Dataset shape: (541909, 8) Missing values: InvoiceNo 0 StockCode 0 Description 1454 Quantity 0 InvoiceDate 0 UnitPrice 0 CustomerID 135080 ← 25% missing! Country 0

Data Cleaning Steps:

code.pyPython
# 1. Remove rows with missing CustomerID (can't do customer analysis without it)
print(f"Before: {len(df)} rows")
df = df[df['CustomerID'].notna()]
print(f"After removing nulls: {len(df)} rows")

# 2. Remove canceled orders (InvoiceNo starts with 'C')
df = df[~df['InvoiceNo'].astype(str).str.startswith('C')]
print(f"After removing cancellations: {len(df)} rows")

# 3. Remove invalid quantities and prices
df = df[(df['Quantity'] > 0) & (df['UnitPrice'] > 0)]
print(f"After removing invalid values: {len(df)} rows")

# 4. Convert InvoiceDate to datetime
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])

# 5. Create calculated columns
df['TotalPrice'] = df['Quantity'] * df['UnitPrice']
df['Year'] = df['InvoiceDate'].dt.year
df['Month'] = df['InvoiceDate'].dt.month
df['YearMonth'] = df['InvoiceDate'].dt.to_period('M')

# 6. Remove outliers (orders > 99th percentile)
upper_limit = df['TotalPrice'].quantile(0.99)
df = df[df['TotalPrice'] <= upper_limit]

print(f"\nFinal dataset: {len(df)} rows")
print(f"Date range: {df['InvoiceDate'].min()} to {df['InvoiceDate'].max()}")
print(f"Unique customers: {df['CustomerID'].nunique()}")
print(f"Unique products: {df['StockCode'].nunique()}")

Save Cleaned Data:

code.pyPython
df.to_csv('cleaned_data.csv', index=False)
print("Cleaned data saved!")
👥

Step 2: RFM Analysis & Customer Segmentation

What is RFM?

RFM stands for:

  • Recency: Days since last purchase
  • Frequency: Number of purchases
  • Monetary: Total amount spent

Calculate RFM Metrics:

code.pyPython
# Set analysis date (day after last transaction)
analysis_date = df['InvoiceDate'].max() + pd.Timedelta(days=1)

# Calculate RFM for each customer
rfm = df.groupby('CustomerID').agg({
    'InvoiceDate': lambda x: (analysis_date - x.max()).days,  # Recency
    'InvoiceNo': 'nunique',  # Frequency
    'TotalPrice': 'sum'  # Monetary
}).reset_index()

rfm.columns = ['CustomerID', 'Recency', 'Frequency', 'Monetary']

print(rfm.head())
print(f"\nRFM Summary:\n{rfm.describe()}")

Create RFM Scores (1-5 scale):

code.pyPython
# Score each metric (5 = best, 1 = worst)
# Recency: lower is better (recent purchase = 5)
# Frequency & Monetary: higher is better

rfm['R_Score'] = pd.qcut(rfm['Recency'], q=5, labels=[5,4,3,2,1])
rfm['F_Score'] = pd.qcut(rfm['Frequency'].rank(method='first'), q=5, labels=[1,2,3,4,5])
rfm['M_Score'] = pd.qcut(rfm['Monetary'], q=5, labels=[1,2,3,4,5])

# Combine into RFM Score
rfm['RFM_Score'] = rfm['R_Score'].astype(str) + rfm['F_Score'].astype(str) + rfm['M_Score'].astype(str)

# Simplified segment (high/medium/low)
rfm['RFM_Total'] = rfm[['R_Score', 'F_Score', 'M_Score']].astype(int).sum(axis=1)

def segment_customer(score):
    if score >= 12:
        return 'Champions'
    elif score >= 9:
        return 'Loyal'
    elif score >= 6:
        return 'Potential'
    else:
        return 'At Risk'

rfm['Segment'] = rfm['RFM_Total'].apply(segment_customer)

print(rfm['Segment'].value_counts())
print(f"\nSegment Revenue:\n{rfm.groupby('Segment')['Monetary'].agg(['count', 'sum', 'mean'])}")

Visualize RFM Segments:

code.pyPython
# Segment distribution
plt.figure(figsize=(10, 6))
segment_counts = rfm['Segment'].value_counts()
plt.bar(segment_counts.index, segment_counts.values, color=['#2ecc71', '#3498db', '#f39c12', '#e74c3c'])
plt.title('Customer Segmentation by RFM', fontsize=14, fontweight='bold')
plt.xlabel('Segment')
plt.ylabel('Number of Customers')
plt.xticks(rotation=45)
for i, v in enumerate(segment_counts.values):
    plt.text(i, v + 50, str(v), ha='center', fontweight='bold')
plt.tight_layout()
plt.savefig('rfm_segments.png', dpi=300)
plt.show()

# Revenue by segment
segment_revenue = rfm.groupby('Segment')['Monetary'].sum().sort_values(ascending=False)
plt.figure(figsize=(10, 6))
plt.bar(segment_revenue.index, segment_revenue.values, color=['#2ecc71', '#3498db', '#f39c12', '#e74c3c'])
plt.title('Total Revenue by Customer Segment', fontsize=14, fontweight='bold')
plt.xlabel('Segment')
plt.ylabel('Revenue (£)')
plt.xticks(rotation=45)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.savefig('segment_revenue.png', dpi=300)
plt.show()

Insight Example:

"Champions (top 15% of customers) generate 45% of total revenue despite being only 627 customers. Focus retention efforts here."

⚠️ CheckpointQuiz error: Missing or invalid options array

📈

Step 3: Cohort Retention Analysis

What is Cohort Analysis?

Track how customers from each signup month behave over time. Shows retention patterns.

Create Cohorts:

code.pyPython
# Identify first purchase date for each customer
df['CohortMonth'] = df.groupby('CustomerID')['InvoiceDate'].transform('min').dt.to_period('M')

# Calculate months since first purchase
df['CohortIndex'] = (df['YearMonth'] - df['CohortMonth']).apply(lambda x: x.n)

# Create cohort table
cohort_data = df.groupby(['CohortMonth', 'CohortIndex'])['CustomerID'].nunique().reset_index()
cohort_pivot = cohort_data.pivot(index='CohortMonth', columns='CohortIndex', values='CustomerID')

print("Cohort Table (Customer Counts):")
print(cohort_pivot)

Calculate Retention Rates:

code.pyPython
# Retention rate = customers in month N / customers in month 0
cohort_size = cohort_pivot.iloc[:, 0]
retention = cohort_pivot.divide(cohort_size, axis=0) * 100

print("\nRetention Rates (%):")
print(retention.round(1))

Visualize Cohort Heatmap:

code.pyPython
plt.figure(figsize=(12, 8))
sns.heatmap(retention, annot=True, fmt='.0f', cmap='RdYlGn', vmin=0, vmax=100, cbar_kws={'label': 'Retention %'})
plt.title('Cohort Retention Heatmap', fontsize=16, fontweight='bold')
plt.xlabel('Months Since First Purchase')
plt.ylabel('Cohort Month')
plt.tight_layout()
plt.savefig('cohort_retention.png', dpi=300)
plt.show()

Insight Example:

"Month 0 → Month 1 retention averages 35%. By Month 3, only 20% of customers return. Implement win-back campaign at 45-day mark."

🗄️

Step 4: SQL Analysis (Alternative Approach)

If you have the data in a SQL database, here are equivalent queries:

Top Customers by Revenue:

query.sqlSQL
-- Top 20 customers by total spend
SELECT
    CustomerID,
    COUNT(DISTINCT InvoiceNo) AS total_orders,
    SUM(Quantity) AS total_items,
    SUM(Quantity * UnitPrice) AS total_revenue
FROM online_retail
WHERE CustomerID IS NOT NULL
GROUP BY CustomerID
ORDER BY total_revenue DESC
LIMIT 20;

Monthly Sales Trend:

query.sqlSQL
-- Revenue and orders by month
SELECT
    DATE_TRUNC('month', InvoiceDate) AS month,
    COUNT(DISTINCT InvoiceNo) AS orders,
    COUNT(DISTINCT CustomerID) AS unique_customers,
    SUM(Quantity * UnitPrice) AS revenue,
    SUM(Quantity * UnitPrice) / COUNT(DISTINCT InvoiceNo) AS avg_order_value
FROM online_retail
WHERE CustomerID IS NOT NULL
    AND Quantity > 0
    AND UnitPrice > 0
GROUP BY month
ORDER BY month;

Product Affinity (Market Basket):

query.sqlSQL
-- Products frequently bought together
WITH product_pairs AS (
    SELECT
        a.InvoiceNo,
        a.Description AS product_a,
        b.Description AS product_b
    FROM online_retail a
    JOIN online_retail b ON a.InvoiceNo = b.InvoiceNo AND a.StockCode < b.StockCode
    WHERE a.CustomerID IS NOT NULL
)

SELECT
    product_a,
    product_b,
    COUNT(*) AS pair_count
FROM product_pairs
GROUP BY product_a, product_b
HAVING COUNT(*) > 50
ORDER BY pair_count DESC
LIMIT 20;
📊

Step 5: Power BI Dashboard

Dashboard Components:

  1. KPI Cards (top row):

    • Total Revenue: £9.8M
    • Total Orders: 25,900
    • Unique Customers: 4,372
    • Avg Order Value: £378
  2. Sales Trend (line chart):

    • X-axis: Month
    • Y-axis: Revenue
    • Show trend line
  3. Top Countries (bar chart):

    • UK: £8.2M (84%)
    • Germany: £0.9M (9%)
    • France: £0.4M (4%)
  4. Customer Segments (pie chart):

    • Champions: 45% revenue
    • Loyal: 30%
    • Potential: 18%
    • At Risk: 7%
  5. Cohort Retention (matrix visual):

    • Import retention table from Python
    • Conditional formatting (green → red)
  6. Filters (slicer):

    • Date range
    • Country
    • Customer segment

Power BI Steps:

  1. Import Data: Get Data → Text/CSV → cleaned_data.csv
  2. Create Measures:
measure.daxDAX
Total Revenue = SUM(online_retail[TotalPrice])
Total Orders = DISTINCTCOUNT(online_retail[InvoiceNo])
Unique Customers = DISTINCTCOUNT(online_retail[CustomerID])
Avg Order Value = [Total Revenue] / [Total Orders]
  1. Build Visuals: Drag fields to canvas
  2. Format: Apply theme, adjust colors
  3. Publish: Save as .pbix, upload to Tableau Public alternative or share screenshot
🚀

Extension Challenges

Take this project further:

1. Advanced Analytics:

  • Predict Customer Churn: Build logistic regression model (scikit-learn) to predict which customers won't return
  • Product Recommendation: Implement collaborative filtering for "customers who bought X also bought Y"
  • Lifetime Value (LTV): Calculate expected revenue per customer based on cohort data

2. Additional Analyses:

  • Seasonality: Does revenue spike during holidays? (Christmas in UK)
  • Day-of-week patterns: Which days have highest sales?
  • Product categories: Create product taxonomy, analyze category trends

3. Data Engineering:

  • Automate ETL: Create Python script that cleans data automatically
  • Schedule dashboard refresh: Set up daily data refresh in Power BI Service
  • Database setup: Load data into PostgreSQL, practice SQL queries

4. Presentation:

  • Write blog post: "5 Insights from Analyzing 500K E-commerce Transactions"
  • Create video walkthrough: 5-minute YouTube demo of your analysis
  • LinkedIn post: Share 3 key insights with visualizations

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}