What is Cohort Analysis: Track User Behavior Over Time?

Master cohort analysis to track user retention, lifetime value, and behavior patterns over time. Learn SQL queries, Python implementation, and real examples from Flipkart, PhonePe.

Is Cohort Analysis: Track User Behavior Over Time suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 12 min to complete and includes 10 interactive quizzes to test your understanding.

How long does it take to learn Cohort Analysis: Track User Behavior Over Time?

You can complete this topic in about 12 min. The topic is part 58 of undefined in our comprehensive Data Analytics Learning Path.

Cohort Analysis Tutorial — Retention, LTV, Churn | DataPath

👥

What is Cohort Analysis?

Cohort analysis groups users who share a common characteristic (like signup month) and tracks their behavior over time to measure retention, engagement, or revenue patterns.

How It Works

1. Define Cohort (Group users by shared trait):

Cohort = "Users who signed up in January 2026"
→ 10,000 users joined Jan 1-31

2. Track Behavior Over Time:

Month 0 (Jan): 10,000 users signed up (100% present)
Month 1 (Feb): 4,500 users returned (45% retention)
Month 2 (Mar): 3,200 users returned (32% retention)
Month 3 (Apr): 2,700 users returned (27% retention)

3. Repeat for Each Cohort:

Jan cohort: 45% → 32% → 27% retention
Feb cohort: 48% → 35% → 30% retention (better!)
Mar cohort: 52% → 38% → 33% retention (even better!)

Insight: Retention improving over time (product getting stickier)

Why Cohort Analysis Matters

Without Cohorts (Overall metrics):

Overall active users: 50,000 in Jan → 55,000 in Feb → 60,000 in Mar
Conclusion: "Growth is good!" ✓

Reality check with cohorts:
- Jan cohort: 10K → 4.5K → 3.2K (losing 68% of users)
- New signups masking churn (acquiring 15K/month, churning 8K/month)
→ Leaky bucket problem (can't sustain growth)

With Cohorts (Time-based groups):

Cohort retention analysis:
Jan cohort: 45% Month 1 retention
Feb cohort: 48% Month 1 retention (after product fix)
Mar cohort: 52% Month 1 retention (fix worked!)

Insight: Product improvements increased retention 7% (45% → 52%)
→ Cohorts prove causality (improvement worked)

Benefits:

Measure product improvements — Did feature X increase retention? (compare cohorts before/after launch)
Predict LTV — If Month 1 retention is 50%, Month 6 is ~25% (extrapolate)
Identify churn patterns — When do users drop off? (Week 1? Month 3?)
Segment analysis — Do paid users retain better than free users?
Budget allocation — High-retention cohorts → Invest more in similar channels

Real Example: Flipkart Cohort Analysis

Goal: Measure retention of users acquired from different marketing channels.

Cohorts: Group users by acquisition month + channel:

Jan 2026 — Google Ads: 5,000 users
Jan 2026 — Facebook Ads: 3,000 users
Jan 2026 — Organic: 2,000 users

Retention Tracking (% who made ≥1 purchase in each subsequent month):

Channel       | Month 0 | Month 1 | Month 2 | Month 3 | LTV (3 months)
Google Ads    | 100%    | 35%     | 22%     | 18%     | ₹1,250
Facebook Ads  | 100%    | 42%     | 28%     | 23%     | ₹1,680
Organic       | 100%    | 58%     | 45%     | 38%     | ₹2,450

Insights:

Organic users retain 2× better than paid ads (58% vs 35-42% Month 1)
Facebook > Google for retention (42% vs 35% Month 1)
Organic LTV is 2× paid (₹2,450 vs ₹1,250-1,680)

Action:

Reallocate budget to Facebook (better ROI than Google)
Invest in organic channels (SEO, referrals) — higher LTV
Study organic users: Why do they stick around? (replicate behavior for paid users)

Think of it this way...

Cohort analysis is like tracking graduating classes in school. You group students by year (Class of 2020, 2021, 2022), then track: How many got jobs within 6 months? 1 year? 5 years? This reveals if career services improved over time (2022 cohort better outcomes than 2020 = improvement).

📊

Types of Cohorts

1. Time-Based Cohorts (Most Common)

Group users by when they first interacted with your product.

Examples:

Signup month cohort: Users who signed up in Jan 2026 vs Feb 2026
First purchase cohort: Users who made first purchase in Q1 vs Q2
App install cohort: Users who installed app in Week 1 vs Week 2

Use Case: Measure retention, track product improvements over time.

SQL Example (Signup month cohorts):

query.sqlSQL

SELECT
  DATE_TRUNC('month', signup_date) AS cohort_month,
  COUNT(DISTINCT user_id) AS cohort_size
FROM users
WHERE signup_date >= '2026-01-01'
GROUP BY cohort_month
ORDER BY cohort_month;

2. Behavior-Based Cohorts

Group users by what they did (action or event).

Examples:

Feature adoption cohort: Users who used feature X vs users who didn't
Purchase frequency cohort: Users who bought 1× vs 5+ times in first month
Engagement level cohort: Users with high engagement (10+ sessions/week) vs low (1-2 sessions)

Use Case: Segment users by behavior to identify power users vs casual users.

SQL Example (Purchase frequency cohorts):

query.sqlSQL

WITH first_month_purchases AS (
  SELECT
    user_id,
    COUNT(*) AS purchase_count_month_0
  FROM orders
  WHERE order_date BETWEEN signup_date AND signup_date + INTERVAL '30 days'
  GROUP BY user_id
)
SELECT
  CASE
    WHEN purchase_count_month_0 = 1 THEN '1 purchase'
    WHEN purchase_count_month_0 BETWEEN 2 AND 4 THEN '2-4 purchases'
    WHEN purchase_count_month_0 >= 5 THEN '5+ purchases'
  END AS cohort,
  COUNT(*) AS users
FROM first_month_purchases
GROUP BY cohort;

3. Acquisition Channel Cohorts

Group users by how they found your product.

Examples:

Google Ads vs Facebook Ads vs Organic
Referral vs Direct vs Email campaign
iOS app vs Android app vs Web

Use Case: Compare retention/LTV across channels to optimize marketing spend.

SQL Example:

query.sqlSQL

SELECT
  acquisition_channel,
  COUNT(DISTINCT user_id) AS cohort_size,
  AVG(lifetime_value) AS avg_ltv
FROM users
WHERE signup_date >= '2026-01-01'
GROUP BY acquisition_channel
ORDER BY avg_ltv DESC;

4. Demographic Cohorts

Group users by who they are (attributes).

Examples:

Age groups (18-24, 25-34, 35-44)
Geography (Tier 1 cities vs Tier 2 vs Tier 3)
Device type (iOS vs Android)

Use Case: Personalize product for different user segments.

Example — Zomato Geography Cohorts:

Tier 1 cities (Mumbai, Delhi, Bangalore):
- 30-day retention: 55%
- Avg order frequency: 6 orders/month

Tier 2 cities (Pune, Jaipur, Chandigarh):
- 30-day retention: 48%
- Avg order frequency: 4 orders/month

Insight: Tier 1 users order 50% more often (6 vs 4) → Focus retention efforts on Tier 2

⚠️ CheckpointQuiz error: Missing or invalid options array

📋

Building Retention Cohort Tables

Retention cohort table shows % of each cohort that returned in subsequent time periods.

Structure

| Cohort | Month 0 | Month 1 | Month 2 | Month 3 | Month 6 | Month 12 | |--------|---------|---------|---------|---------|---------|----------| | Jan 2026 | 100% | 45% | 32% | 27% | 18% | 12% | | Feb 2026 | 100% | 48% | 35% | 30% | 21% | - | | Mar 2026 | 100% | 52% | 38% | 33% | - | - | | Apr 2026 | 100% | 55% | 42% | - | - | - |

Reading the table:

Jan 2026 cohort: 45% returned in Month 1, 32% in Month 2, etc.
Diagonal pattern: Newer cohorts have less time elapsed (can't measure Month 12 for Apr cohort yet)
Improving trend: Month 1 retention increased from 45% → 55% (product getting stickier)

SQL Query — Retention Cohort Table

Schema:

query.sqlSQL

users (user_id, signup_date)
events (user_id, event_date, event_type)

Query (Monthly retention cohorts):

query.sqlSQL

WITH cohorts AS (
  -- Define cohort (signup month)
  SELECT
    user_id,
    DATE_TRUNC('month', signup_date) AS cohort_month
  FROM users
),
user_activity AS (
  -- Get user activity by month
  SELECT
    user_id,
    DATE_TRUNC('month', event_date) AS activity_month
  FROM events
  WHERE event_type = 'purchase'  -- or 'login', 'session_start', etc.
  GROUP BY user_id, activity_month
),
cohort_activity AS (
  -- Join cohorts with activity, calculate months since signup
  SELECT
    c.cohort_month,
    c.user_id,
    a.activity_month,
    DATEDIFF('month', c.cohort_month, a.activity_month) AS months_since_signup
  FROM cohorts c
  LEFT JOIN user_activity a ON c.user_id = a.user_id
),
cohort_sizes AS (
  -- Count cohort size (Month 0)
  SELECT
    cohort_month,
    COUNT(DISTINCT user_id) AS cohort_size
  FROM cohorts
  GROUP BY cohort_month
),
retention_by_month AS (
  -- Count active users per cohort per month
  SELECT
    cohort_month,
    months_since_signup,
    COUNT(DISTINCT user_id) AS active_users
  FROM cohort_activity
  WHERE activity_month IS NOT NULL
  GROUP BY cohort_month, months_since_signup
)
-- Calculate retention rate
SELECT
  r.cohort_month,
  r.months_since_signup,
  r.active_users,
  cs.cohort_size,
  ROUND(100.0 * r.active_users / cs.cohort_size, 1) AS retention_pct
FROM retention_by_month r
JOIN cohort_sizes cs ON r.cohort_month = cs.cohort_month
ORDER BY r.cohort_month, r.months_since_signup;

Output:

cohort_month | months_since_signup | active_users | cohort_size | retention_pct
2026-01-01   | 0                   | 10000        | 10000       | 100.0
2026-01-01   | 1                   | 4500         | 10000       | 45.0
2026-01-01   | 2                   | 3200         | 10000       | 32.0
2026-02-01   | 0                   | 12000        | 12000       | 100.0
2026-02-01   | 1                   | 5760         | 12000       | 48.0
...

Python Visualization — Retention Heatmap

code.pyPython

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Load SQL results
df = pd.read_sql(query, connection)

# Pivot to cohort table format
cohort_table = df.pivot_table(
    index='cohort_month',
    columns='months_since_signup',
    values='retention_pct'
)

# Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(
    cohort_table,
    annot=True,         # Show percentages in cells
    fmt='.0f',          # Format as integers
    cmap='RdYlGn',      # Red (low) to Green (high)
    vmin=0,
    vmax=100,
    cbar_kws={'label': 'Retention %'}
)
plt.title('Monthly Retention Cohort Analysis')
plt.xlabel('Months Since Signup')
plt.ylabel('Cohort (Signup Month)')
plt.tight_layout()
plt.show()

Output: Heatmap with green cells (high retention) and red cells (low retention).

Interpreting Retention Curves

Healthy Retention Curve (Logarithmic decay):

Month 0: 100%
Month 1: 40-60%  (sharp drop — expected)
Month 2: 30-45%  (slower drop)
Month 3: 25-40%  (flattening)
Month 6: 20-35%  (stable)
Month 12: 15-30% (plateau)

Pattern: Sharp initial drop, then flatten (users who stay past Month 3 are sticky)

Unhealthy Retention Curve (Linear decay):

Month 0: 100%
Month 1: 30%
Month 2: 20%
Month 3: 10%
Month 6: 5%
Month 12: 2%

Pattern: Continuous linear decay (never plateaus) — leaky bucket, no product-market fit

Info

Key Metric: If Month 6 retention ≥ 20%, you have product-market fit. If < 10%, users aren't finding long-term value (high churn).

🚀

Advanced Cohort Analysis Techniques

1. Cumulative Cohort Revenue (LTV Projection)

Track cumulative revenue per cohort to project lifetime value.

SQL Query:

query.sqlSQL

WITH cohort_revenue AS (
  SELECT
    DATE_TRUNC('month', u.signup_date) AS cohort_month,
    DATEDIFF('month', DATE_TRUNC('month', u.signup_date),
             DATE_TRUNC('month', o.order_date)) AS months_since_signup,
    SUM(o.order_value) AS revenue
  FROM users u
  JOIN orders o ON u.user_id = o.user_id
  GROUP BY cohort_month, months_since_signup
),
cumulative AS (
  SELECT
    cohort_month,
    months_since_signup,
    SUM(revenue) OVER (
      PARTITION BY cohort_month
      ORDER BY months_since_signup
    ) AS cumulative_revenue
  FROM cohort_revenue
)
SELECT
  cohort_month,
  months_since_signup,
  cumulative_revenue /
    (SELECT COUNT(*) FROM users WHERE DATE_TRUNC('month', signup_date) = cohort_month)
    AS revenue_per_user
FROM cumulative;

Output:

Jan 2026 cohort:
Month 0: ₹450 per user
Month 1: ₹780 per user (cumulative)
Month 2: ₹980 per user
Month 6: ₹1,650 per user
→ Projected 12-month LTV: ~₹2,000

Use Case: If Month 6 LTV is ₹1,650 and CAC is ₹1,200, LTV/CAC = 1.4 (marginal). Need to improve retention or reduce CAC.

2. Cohort Churn Analysis

Measure when users churn (time to churn).

SQL Query (Days to last activity):

query.sqlSQL

WITH last_activity AS (
  SELECT
    user_id,
    MAX(event_date) AS last_active_date
  FROM events
  GROUP BY user_id
),
churn_time AS (
  SELECT
    u.user_id,
    DATE_TRUNC('month', u.signup_date) AS cohort_month,
    DATEDIFF('day', u.signup_date, la.last_active_date) AS days_to_churn
  FROM users u
  JOIN last_activity la ON u.user_id = la.user_id
  WHERE la.last_active_date < CURRENT_DATE - INTERVAL '30 days'  -- Inactive for 30+ days = churned
)
SELECT
  cohort_month,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY days_to_churn) AS median_days_to_churn,
  AVG(days_to_churn) AS avg_days_to_churn
FROM churn_time
GROUP BY cohort_month
ORDER BY cohort_month;

Output:

Jan cohort: Median churn at 18 days, Avg 35 days
Feb cohort: Median churn at 24 days, Avg 42 days (improving!)

Insight: Product fixes delayed churn by 6 days (18 → 24 median)

3. Segmented Cohort Analysis

Compare cohorts across segments (channel, device, plan tier).

Example — Swiggy: Paid vs Free Delivery Pass Users:

query.sqlSQL

WITH cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('month', signup_date) AS cohort_month,
    CASE WHEN has_delivery_pass THEN 'Paid Pass' ELSE 'Free' END AS segment
  FROM users
),
retention AS (
  -- (Same retention logic as before, grouped by segment)
  ...
)
SELECT
  cohort_month,
  segment,
  months_since_signup,
  retention_pct
FROM retention
ORDER BY cohort_month, segment, months_since_signup;

Output:

Paid Pass users: 75% Month 1 retention, 60% Month 3
Free users: 35% Month 1 retention, 20% Month 3

Insight: Paid Pass users retain 2× better (invest in converting free → paid)

4. Cohort A/B Testing

Compare cohorts exposed to product changes vs control.

Example — PhonePe: New Onboarding Flow:

Cohorts:
- Control (Jan 1-15): Old onboarding (10K users)
- Treatment (Jan 16-31): New onboarding (10K users)

Results:
Control: 40% Month 1 retention
Treatment: 48% Month 1 retention

Conclusion: New onboarding increased retention 20% (40% → 48%) — Deploy to all users

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}