Topic 33 of

Exploratory Data Analysis — The Analyst's Playbook

EDA is where insights hide. Skip it, and you'll miss patterns, outliers, and relationships that make the difference between average analysis and excellent analysis.

📚Intermediate
⏱️13 min
10 quizzes
🔍

What is Exploratory Data Analysis?

Exploratory Data Analysis (EDA) is the process of investigating datasets to understand their structure, spot patterns, detect anomalies, and generate hypotheses — before formal modeling or reporting.

Why EDA Matters

1. Prevents Garbage In, Garbage Out You can't build good models or reports on bad data. EDA catches data quality issues early: missing values, duplicates, wrong data types, outliers.

2. Reveals Hidden Patterns Summary statistics like "mean = 2500" hide the story. EDA shows: Is data skewed? Bimodal? Are there subgroups? Unexpected correlations?

3. Guides Analysis Strategy Discovering that 60% of values are in one category changes your approach. Finding strong correlations suggests relationships to investigate. EDA informs what questions to ask.

4. Builds Intuition The more time you spend with data, the better you understand its quirks, limitations, and potential. EDA is how analysts develop "data sense."

The EDA Mindset

Don't approach data with a fixed hypothesis. Start open-minded:

  • "What patterns exist here?"
  • "What's unusual or surprising?"
  • "What relationships can I see?"
  • "What questions should I ask?"

Later, you'll confirm hypotheses. EDA is the discovery phase.

Think of it this way...

EDA is like a detective examining a crime scene. You don't jump to conclusions — you observe, note anomalies, look for patterns, and let evidence guide your investigation. Only then do you form theories.

📋

The Systematic EDA Workflow

EDA isn't random — follow this systematic approach every time.

1. First Look — High-Level Overview

code.pyPython
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load data
df = pd.read_csv('swiggy_orders.csv')

# Basic info
print(df.shape)  # (100000, 12) — 100k orders, 12 features
print(df.columns)
print(df.dtypes)
print(df.head(10))
print(df.tail(10))  # Check for sorting, data loading issues

What to look for:

  • How many rows and columns?
  • What are the column names? Do they make sense?
  • Are data types correct? (dates as datetime, numbers as int/float)
  • First/last rows — any obvious issues?

2. Data Quality Check

code.pyPython
# Missing values
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
print(pd.DataFrame({'Missing': missing, 'Percentage': missing_pct}).sort_values('Missing', ascending=False))

# Duplicates
print(f"\nDuplicate rows: {df.duplicated().sum()}")

# Unique values per column
print("\nUnique values:")
for col in df.columns:
    print(f"  {col}: {df[col].nunique()} unique values")

# Sample values for categorical columns
for col in df.select_dtypes(include='object').columns:
    print(f"\n{col} — sample values:")
    print(df[col].value_counts().head())

What to look for:

  • High missing value percentages (>20%)
  • Unexpected duplicates
  • Columns with only 1 unique value (drop them — no information)
  • Categorical columns with too many categories (need grouping?)
  • Strange values (typos, placeholder text like "Unknown", "N/A")

3. Univariate Analysis — One Variable at a Time

Understand each variable in isolation before looking at relationships.

For Numeric Variables:

code.pyPython
# Statistical summary
print(df.describe())

# Detailed stats for key variables
for col in ['order_amount', 'delivery_time', 'rating']:
    print(f"\n{col}:")
    print(f"  Mean: {df[col].mean():.2f}")
    print(f"  Median: {df[col].median():.2f}")
    print(f"  Std Dev: {df[col].std():.2f}")
    print(f"  Min: {df[col].min():.2f}")
    print(f"  Max: {df[col].max():.2f}")
    print(f"  25th percentile: {df[col].quantile(0.25):.2f}")
    print(f"  75th percentile: {df[col].quantile(0.75):.2f}")

# Visualize distributions
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

sns.histplot(df['order_amount'], bins=50, kde=True, ax=axes[0, 0])
axes[0, 0].set_title('Order Amount Distribution')

sns.boxplot(y=df['order_amount'], ax=axes[0, 1])
axes[0, 1].set_title('Order Amount — Boxplot (Outliers)')

sns.histplot(df['delivery_time'], bins=30, kde=True, ax=axes[1, 0], color='coral')
axes[1, 0].set_title('Delivery Time Distribution')

sns.boxplot(y=df['delivery_time'], ax=axes[1, 1], color='coral')
axes[1, 1].set_title('Delivery Time — Boxplot')

plt.tight_layout()
plt.show()

What to look for:

  • Shape: Normal? Skewed? Bimodal (two peaks)?
  • Central tendency: Mean vs median (if far apart, data is skewed)
  • Spread: Large standard deviation = high variability
  • Outliers: Box plot shows values far from IQR
  • Range: Min/max make business sense? (negative amounts? 200-minute delivery?)

For Categorical Variables:

code.pyPython
# Frequency counts
print(df['city'].value_counts())
print(df['order_status'].value_counts(normalize=True) * 100)  # As percentages

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

df['city'].value_counts().head(10).plot(kind='barh', ax=axes[0], color='steelblue')
axes[0].set_title('Top 10 Cities by Order Count')
axes[0].set_xlabel('Orders')

df['order_status'].value_counts().plot(kind='pie', autopct='%1.1f%%', ax=axes[1])
axes[1].set_title('Order Status Distribution')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

What to look for:

  • Class imbalance: One category dominates? (90% delivered, 10% cancelled)
  • Rare categories: Categories with 1% data (group as "Other"?)
  • Unexpected values: Typos ("Mumbia" vs "Mumbai"), inconsistent formats

4. Bivariate Analysis — Relationships Between Variables

Explore how variables relate to each other.

Numeric vs Numeric:

code.pyPython
# Correlation matrix
corr = df[['order_amount', 'delivery_time', 'rating', 'distance_km']].corr()

plt.figure(figsize=(8, 6))
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, fmt='.2f', linewidths=1)
plt.title('Correlation Matrix')
plt.show()

# Scatter plots for key relationships
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

axes[0].scatter(df['delivery_time'], df['rating'], alpha=0.3, s=10)
axes[0].set_xlabel('Delivery Time (min)')
axes[0].set_ylabel('Rating')
axes[0].set_title('Delivery Time vs Rating')

axes[1].scatter(df['distance_km'], df['delivery_time'], alpha=0.3, s=10, color='coral')
axes[1].set_xlabel('Distance (km)')
axes[1].set_ylabel('Delivery Time (min)')
axes[1].set_title('Distance vs Delivery Time')

axes[2].scatter(df['order_amount'], df['rating'], alpha=0.3, s=10, color='green')
axes[2].set_xlabel('Order Amount (₹)')
axes[2].set_ylabel('Rating')
axes[2].set_title('Order Amount vs Rating')

plt.tight_layout()
plt.show()

What to look for:

  • Strong correlations (|r| > 0.7): Variables move together
  • Negative correlations: One increases, other decreases
  • Non-linear patterns: Correlation = 0 doesn't mean no relationship (might be curved)

Categorical vs Numeric:

code.pyPython
# Compare numeric variable across categories
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.boxplot(data=df, x='city', y='order_amount', ax=axes[0])
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45, ha='right')
axes[0].set_title('Order Amount by City')

sns.violinplot(data=df, x='order_status', y='delivery_time', ax=axes[1])
axes[1].set_title('Delivery Time by Order Status')

plt.tight_layout()
plt.show()

# Summary statistics by group
print(df.groupby('city')['order_amount'].agg(['mean', 'median', 'std', 'count']))

What to look for:

  • Differences between groups: Mumbai orders 30% higher than Delhi?
  • Overlapping distributions: If box plots heavily overlap, groups aren't different
  • Sample size imbalances: Bangalore 100 orders vs Mumbai 10,000 (unequal comparison)

Categorical vs Categorical:

code.pyPython
# Cross-tabulation
ct = pd.crosstab(df['city'], df['order_status'], normalize='index') * 100
print(ct.round(1))

# Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(ct, annot=True, cmap='YlOrRd', fmt='.1f')
plt.title('Order Status by City (%)')
plt.xlabel('Order Status')
plt.ylabel('City')
plt.show()

What to look for:

  • Patterns: Delhi has 15% cancellation rate, Mumbai only 5% — why?
  • Independence: Are variables related or independent?

⚠️ CheckpointQuiz error: Missing or invalid options array

🚀

Advanced EDA Techniques

Beyond basics, these techniques reveal deeper insights.

1. Segmentation Analysis

Break data into meaningful groups and compare.

code.pyPython
# Create customer segments
df['customer_segment'] = pd.cut(
    df.groupby('customer_id')['order_amount'].transform('sum'),
    bins=[0, 5000, 20000, 100000],
    labels=['Low-Value', 'Mid-Value', 'High-Value']
)

# Compare segments
segment_stats = df.groupby('customer_segment').agg({
    'order_amount': ['mean', 'median'],
    'rating': 'mean',
    'order_id': 'count'
})
print(segment_stats)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

segment_stats['order_amount']['mean'].plot(kind='bar', ax=axes[0], color='steelblue')
axes[0].set_title('Average Order Value by Customer Segment')
axes[0].set_ylabel('Average Order Amount (₹)')

segment_stats['rating']['mean'].plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Average Rating by Customer Segment')
axes[1].set_ylabel('Average Rating')

plt.tight_layout()
plt.show()

2. Time-Series Patterns

If you have date/time data, look for trends and seasonality.

code.pyPython
# Convert to datetime
df['order_date'] = pd.to_datetime(df['order_date'])

# Extract time features
df['month'] = df['order_date'].dt.month
df['day_of_week'] = df['order_date'].dt.dayofweek  # 0=Monday, 6=Sunday
df['hour'] = df['order_date'].dt.hour

# Daily trends
daily_orders = df.groupby(df['order_date'].dt.date).size()

plt.figure(figsize=(14, 5))
daily_orders.plot(color='steelblue', linewidth=1.5)
plt.title('Daily Order Volume — Time Series')
plt.xlabel('Date')
plt.ylabel('Orders')
plt.grid(True, alpha=0.3)
plt.show()

# Day of week pattern
dow_orders = df.groupby('day_of_week').size()
dow_labels = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']

plt.figure(figsize=(10, 5))
plt.bar(range(7), dow_orders.values, tick_label=dow_labels, color='coral')
plt.title('Orders by Day of Week')
plt.xlabel('Day')
plt.ylabel('Orders')
plt.show()

# Hourly pattern
hourly_orders = df.groupby('hour').size()

plt.figure(figsize=(12, 5))
plt.plot(hourly_orders.index, hourly_orders.values, marker='o', linewidth=2, color='green')
plt.title('Orders by Hour of Day')
plt.xlabel('Hour')
plt.ylabel('Orders')
plt.xticks(range(24))
plt.grid(True, alpha=0.3)
plt.show()

What to look for:

  • Trends: Growing? Declining? Stable?
  • Seasonality: Weekly patterns (weekend spike?), monthly patterns (end-of-month?)
  • Anomalies: Sudden drops/spikes (holidays? marketing campaigns? outages?)

3. Outlier Detection

code.pyPython
# Z-score method
from scipy import stats

z_scores = np.abs(stats.zscore(df['order_amount'].dropna()))
outliers_z = df[z_scores > 3]
print(f"Outliers (Z-score > 3): {len(outliers_z)}")

# IQR method
Q1 = df['order_amount'].quantile(0.25)
Q3 = df['order_amount'].quantile(0.75)
IQR = Q3 - Q1
outliers_iqr = df[(df['order_amount'] < Q1 - 1.5*IQR) | (df['order_amount'] > Q3 + 1.5*IQR)]
print(f"Outliers (IQR method): {len(outliers_iqr)}")

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df['order_amount'], bins=100, color='steelblue', alpha=0.7)
axes[0].axvline(Q1 - 1.5*IQR, color='red', linestyle='--', label='Lower Bound')
axes[0].axvline(Q3 + 1.5*IQR, color='red', linestyle='--', label='Upper Bound')
axes[0].set_title('Order Amount Distribution with Outlier Bounds')
axes[0].legend()

sns.boxplot(y=df['order_amount'], ax=axes[1])
axes[1].set_title('Boxplot Highlighting Outliers')

plt.tight_layout()
plt.show()

# Inspect outliers
print("\nSample outliers:")
print(outliers_iqr[['order_id', 'order_amount', 'city', 'order_status']].head(10))

4. Missing Value Patterns

Missing data isn't random — patterns reveal insights.

code.pyPython
# Missing value matrix
import missingno as msno  # pip install missingno

msno.matrix(df, figsize=(12, 6))
plt.title('Missing Value Patterns')
plt.show()

# Correlation of missingness
missing_corr = df.isnull().corr()
plt.figure(figsize=(10, 8))
sns.heatmap(missing_corr, annot=True, cmap='coolwarm', center=0, fmt='.2f')
plt.title('Correlation of Missing Values')
plt.show()

What to look for:

  • Systematic missingness: Column A missing whenever B is missing (structural issue?)
  • MCAR (Missing Completely At Random): Random pattern — safe to drop or impute
  • MAR (Missing At Random): Related to other variables (model it!)
  • MNAR (Missing Not At Random): Missingness is informative (create "is_missing" flag)
🛠️

EDA Tools and Libraries

Automate repetitive EDA tasks with these libraries.

pandas-profiling (ydata-profiling)

Generates comprehensive EDA reports in one line.

code.pyPython
# Install: pip install ydata-profiling
from ydata_profiling import ProfileReport

# Generate report
profile = ProfileReport(df, title="Swiggy Orders EDA", explorative=True)

# Save to HTML
profile.to_file("eda_report.html")

# Display in Jupyter
profile.to_widgets()

Report includes:

  • Overview (shape, missing values, duplicates)
  • Variable distributions (histograms, stats)
  • Correlations (heatmaps, scatter plots)
  • Missing value analysis
  • Sample data

sweetviz

Comparative analysis between datasets (train vs test, before vs after).

code.pyPython
# Install: pip install sweetviz
import sweetviz as sv

# Analyze single dataset
report = sv.analyze(df)
report.show_html("sweetviz_report.html")

# Compare two datasets
train_report = sv.compare([train_df, "Training"], [test_df, "Testing"])
train_report.show_html("train_vs_test.html")

dtale

Interactive data exploration in your browser.

code.pyPython
# Install: pip install dtale
import dtale

# Launch interactive viewer
d = dtale.show(df)
d.open_browser()

Features:

  • Filter, sort, search data interactively
  • Create visualizations with clicks
  • Export cleaned data
  • Describe, correlate, and profile columns
Info

Pro Tip: Use automated tools (pandas-profiling) for initial exploration, then dive deeper with custom code for specific questions. Tools give breadth; custom analysis gives depth.

The Complete EDA Checklist

Use this checklist for every dataset:

Data Quality

  • [ ] Check shape (rows, columns)
  • [ ] Inspect first/last rows
  • [ ] Verify data types (convert if needed)
  • [ ] Identify missing values (count, percentage, pattern)
  • [ ] Find duplicates (full rows, key columns)
  • [ ] Check for placeholder values ("Unknown", -999, etc.)

Univariate Analysis

  • [ ] Summary statistics for numeric columns (mean, median, std, min, max, quartiles)
  • [ ] Distributions (histograms, KDE plots)
  • [ ] Outliers (box plots, IQR, Z-scores)
  • [ ] Frequency counts for categorical columns
  • [ ] Rare categories (< 1% of data)

Bivariate Analysis

  • [ ] Correlation matrix (numeric vs numeric)
  • [ ] Scatter plots for key relationships
  • [ ] Group comparisons (categorical vs numeric)
  • [ ] Cross-tabulations (categorical vs categorical)

Multivariate Analysis

  • [ ] Pair plots (all variables)
  • [ ] Segmentation analysis (customer groups, product categories)
  • [ ] Heatmaps (multi-way relationships)

Time Series (if applicable)

  • [ ] Time-based trends (daily, weekly, monthly)
  • [ ] Seasonality patterns
  • [ ] Anomalies and outliers over time

Documentation

  • [ ] Note assumptions and decisions
  • [ ] Flag data quality issues for stakeholders
  • [ ] List questions for domain experts
  • [ ] Document insights for further investigation

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}