What is Data Analyst Interview Questions: 100+ Questions with Answers?

Comprehensive guide to data analyst interview questions covering SQL, Python, statistics, case studies, and behavioral questions. Includes answers, examples, and tips from analysts at Google, Amazon, Flipkart.

Is Data Analyst Interview Questions: 100+ Questions with Answers suitable for beginners?

This topic is designed for Intermediate level learners. It takes approximately 20 min to complete and includes 10 interactive quizzes to test your understanding.

How long does it take to learn Data Analyst Interview Questions: 100+ Questions with Answers?

You can complete this topic in about 20 min. The topic is part 76 of undefined in our comprehensive Data Analytics Learning Path.

Data Analyst Interview Questions — 100+ SQL, Python, Statistics Q&A | DataPath

🎯

Interview Stages: What to Expect

Typical Data Analyst Interview Process

Stage 1: HR Screening (30 min phone call)

Background, motivation, salary expectations
Basic technical questions ("What's your experience with SQL?")
Outcome: 60% pass → Move to Stage 2

Stage 2: SQL/Technical Test (60-90 min)

Live coding (write SQL queries on screen) or take-home (48-72 hours)
3-5 problems: Joins, aggregations, window functions
Outcome: 40% pass → Move to Stage 3

Stage 3: Take-Home Assignment (1 week)

Real business problem: "Analyze this dataset, create dashboard, present insights"
Deliverable: Jupyter notebook + Tableau dashboard + presentation (PDF/PPT)
Outcome: 50% pass → Move to Stage 4

Stage 4: Case Study Interview (60-90 min)

Interviewer gives business problem: "Our sales dropped 20% last month. How would you investigate?"
You: Ask clarifying questions → Propose analysis approach → Walk through methodology
Outcome: 60% pass → Move to Stage 5

Stage 5: Behavioral + Culture Fit (30-60 min)

"Tell me about a time you disagreed with a stakeholder"
"Why do you want to work here?"
Outcome: 70% pass → Offer

Total timeline: 2-6 weeks (varies by company)

Info

Success rates (cumulative): 60% × 40% × 50% × 60% × 70% = 10% overall (10 applicants → 1 offer). Prepare thoroughly for each stage to beat the odds.

💾

SQL Interview Questions (Beginner to Advanced)

Beginner SQL Questions

Q1: Write a query to find the top 5 products by total revenue.

Schema:

query.sqlSQL

orders (order_id, product_id, quantity, price, order_date)

Answer:

query.sqlSQL

SELECT
  product_id,
  SUM(quantity * price) AS total_revenue
FROM orders
GROUP BY product_id
ORDER BY total_revenue DESC
LIMIT 5;

Follow-up: "What if you want top 5 products per category?"

query.sqlSQL

WITH product_revenue AS (
  SELECT
    category_id,
    product_id,
    SUM(quantity * price) AS total_revenue
  FROM orders
  JOIN products USING (product_id)
  GROUP BY category_id, product_id
),
ranked AS (
  SELECT
    category_id,
    product_id,
    total_revenue,
    ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY total_revenue DESC) AS rank
  FROM product_revenue
)
SELECT category_id, product_id, total_revenue
FROM ranked
WHERE rank <= 5;

Q2: Calculate month-over-month growth rate for revenue.

Schema:

query.sqlSQL

orders (order_id, revenue, order_date)

Answer:

query.sqlSQL

WITH monthly_revenue AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(revenue) AS total_revenue
  FROM orders
  GROUP BY month
)
SELECT
  month,
  total_revenue,
  LAG(total_revenue) OVER (ORDER BY month) AS prev_month_revenue,
  (total_revenue - LAG(total_revenue) OVER (ORDER BY month)) * 100.0 /
    LAG(total_revenue) OVER (ORDER BY month) AS growth_rate
FROM monthly_revenue;

Key concept: Use LAG() window function to access previous row's value.

Intermediate SQL Questions

Q3: Find users who made purchases in Jan 2024 but NOT in Feb 2024 (churned users).

Schema:

query.sqlSQL

orders (order_id, user_id, order_date)

Answer:

query.sqlSQL

SELECT DISTINCT user_id
FROM orders
WHERE DATE_TRUNC('month', order_date) = '2024-01-01'
  AND user_id NOT IN (
    SELECT DISTINCT user_id
    FROM orders
    WHERE DATE_TRUNC('month', order_date) = '2024-02-01'
  );

Better approach (avoids NOT IN with NULL issues):

query.sqlSQL

SELECT DISTINCT jan.user_id
FROM orders jan
LEFT JOIN orders feb
  ON jan.user_id = feb.user_id
  AND DATE_TRUNC('month', feb.order_date) = '2024-02-01'
WHERE DATE_TRUNC('month', jan.order_date) = '2024-01-01'
  AND feb.user_id IS NULL;

Q4: Calculate cumulative sum of revenue by day.

Answer:

query.sqlSQL

SELECT
  order_date,
  SUM(revenue) AS daily_revenue,
  SUM(SUM(revenue)) OVER (ORDER BY order_date) AS cumulative_revenue
FROM orders
GROUP BY order_date
ORDER BY order_date;

Key concept: Window function SUM() OVER (ORDER BY ...) calculates running total.

Advanced SQL Questions

Q5: Find the 2nd highest salary in each department.

Schema:

query.sqlSQL

employees (employee_id, department_id, salary)

Answer:

query.sqlSQL

WITH ranked_salaries AS (
  SELECT
    department_id,
    salary,
    DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
  FROM employees
)
SELECT department_id, salary
FROM ranked_salaries
WHERE rank = 2;

Why DENSE_RANK() not ROW_NUMBER()?

If 2 employees have same highest salary (₹10L), DENSE_RANK() gives both rank 1, next salary is rank 2
ROW_NUMBER() gives ranks 1, 2, 3 (skips ties) → Might miss 2nd highest salary

Q6: Calculate retention rate (% of Month 0 users who returned in Month 1, Month 2, ...).

Schema:

query.sqlSQL

orders (user_id, order_date)

Answer:

query.sqlSQL

WITH user_cohorts AS (
  SELECT
    user_id,
    DATE_TRUNC('month', MIN(order_date)) AS cohort_month
  FROM orders
  GROUP BY user_id
),
user_activity AS (
  SELECT
    uc.cohort_month,
    DATE_TRUNC('month', o.order_date) AS activity_month,
    COUNT(DISTINCT o.user_id) AS active_users
  FROM user_cohorts uc
  JOIN orders o ON uc.user_id = o.user_id
  GROUP BY uc.cohort_month, activity_month
),
cohort_sizes AS (
  SELECT
    cohort_month,
    COUNT(DISTINCT user_id) AS cohort_size
  FROM user_cohorts
  GROUP BY cohort_month
)
SELECT
  ua.cohort_month,
  ua.activity_month,
  EXTRACT(MONTH FROM AGE(ua.activity_month, ua.cohort_month)) AS months_since_cohort,
  ua.active_users,
  cs.cohort_size,
  ua.active_users * 100.0 / cs.cohort_size AS retention_rate
FROM user_activity ua
JOIN cohort_sizes cs ON ua.cohort_month = cs.cohort_month
ORDER BY ua.cohort_month, ua.activity_month;

Output:

cohort_month | activity_month | months_since_cohort | active_users | cohort_size | retention_rate
-------------|----------------|---------------------|--------------|-------------|---------------
2024-01-01   | 2024-01-01     | 0                   | 1000         | 1000        | 100.0
2024-01-01   | 2024-02-01     | 1                   | 600          | 1000        | 60.0
2024-01-01   | 2024-03-01     | 2                   | 400          | 1000        | 40.0

Key concept: Cohort analysis using self-joins and window functions.

Q7: Find users who purchased Product A and Product B together in same order (market basket analysis).

Schema:

query.sqlSQL

order_items (order_id, product_id)

Answer:

query.sqlSQL

SELECT DISTINCT o1.order_id
FROM order_items o1
JOIN order_items o2
  ON o1.order_id = o2.order_id
  AND o1.product_id < o2.product_id
WHERE o1.product_id = 'Product_A'
  AND o2.product_id = 'Product_B';

Follow-up: "Find top 10 product pairs most frequently bought together"

query.sqlSQL

SELECT
  o1.product_id AS product_a,
  o2.product_id AS product_b,
  COUNT(DISTINCT o1.order_id) AS times_bought_together
FROM order_items o1
JOIN order_items o2
  ON o1.order_id = o2.order_id
  AND o1.product_id < o2.product_id
GROUP BY o1.product_id, o2.product_id
ORDER BY times_bought_together DESC
LIMIT 10;

⚠️ CheckpointQuiz error: Missing or invalid options array

🐍

Python Interview Questions

Data Manipulation (pandas)

Q8: Remove duplicate rows from DataFrame.

Question: "You have DataFrame with 100K rows, 5% duplicates. How to remove?"

Answer:

code.pyPython

import pandas as pd

# Sample data
df = pd.DataFrame({
    'user_id': [1, 2, 2, 3, 4, 4],
    'order_date': ['2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-04'],
    'revenue': [100, 200, 200, 150, 300, 300]
})

# Remove exact duplicates (all columns match)
df_deduped = df.drop_duplicates()

# Remove duplicates based on specific columns (keep first occurrence)
df_deduped = df.drop_duplicates(subset=['user_id', 'order_date'], keep='first')

print(f"Original: {len(df)} rows, Deduped: {len(df_deduped)} rows")

Follow-up: "What if you want to keep the row with highest revenue (not first occurrence)?"

code.pyPython

# Sort by revenue descending, then drop duplicates (keeps highest revenue row)
df_sorted = df.sort_values('revenue', ascending=False)
df_deduped = df_sorted.drop_duplicates(subset=['user_id', 'order_date'], keep='first')

Q9: Calculate rolling 7-day average of sales.

Answer:

code.pyPython

import pandas as pd

df = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=30),
    'sales': [100, 120, 110, 130, 140, 135, 145, 150, 155, 160] * 3
})

# Calculate 7-day rolling average
df['sales_7day_avg'] = df['sales'].rolling(window=7).mean()

# Alternative: Rolling with min_periods (handles first 6 days with <7 data points)
df['sales_7day_avg'] = df['sales'].rolling(window=7, min_periods=1).mean()

print(df.head(10))

Data Analysis Questions

Q10: Find top 10% of customers by revenue (whale customers).

Answer:

code.pyPython

import pandas as pd

df = pd.DataFrame({
    'customer_id': range(1, 101),
    'revenue': [1000 + i*10 for i in range(100)]
})

# Calculate 90th percentile
p90 = df['revenue'].quantile(0.90)

# Filter top 10% customers
whale_customers = df[df['revenue'] >= p90]

print(f"90th percentile: ₹{p90}")
print(f"Whale customers: {len(whale_customers)} ({len(whale_customers)/len(df)*100:.0f}%)")
print(whale_customers.head())

Q11: Pivot table: Revenue by Product Category and Month.

Answer:

code.pyPython

import pandas as pd

df = pd.DataFrame({
    'order_date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'category': ['Electronics', 'Fashion', 'Home'] * 33 + ['Electronics'],
    'revenue': [1000 + i*10 for i in range(100)]
})

# Extract month from date
df['month'] = df['order_date'].dt.to_period('M')

# Create pivot table
pivot = df.pivot_table(
    values='revenue',
    index='month',
    columns='category',
    aggfunc='sum',
    fill_value=0
)

print(pivot)

Output:

category      Electronics  Fashion   Home
month
2024-01         XXX         XXX      XXX
2024-02         XXX         XXX      XXX
2024-03         XXX         XXX      XXX

📊

Statistics Interview Questions

Descriptive Statistics

Q12: Explain the difference between mean, median, and mode. When to use each?

Answer:

Mean: Average (sum ÷ count). Use when data is normally distributed, no outliers.
- Example: Average order value = ₹1,500 (most orders are ₹1,000-₹2,000)
Median: Middle value when sorted. Use when data has outliers.
- Example: Salary data (CEO earns ₹1 crore, 99 employees earn ₹5L → Mean = ₹14L, Median = ₹5L). Median is more representative.
Mode: Most frequent value. Use for categorical data.
- Example: Most common payment method = UPI (50% of transactions)

Follow-up: "Revenue data: [100, 120, 110, 130, 5000]. Should you use mean or median?"

Mean = (100+120+110+130+5000)/5 = ₹1,092 (distorted by outlier ₹5,000)
Median = 120 (middle value, ignores outlier) → Use median (better represents typical revenue)

Q13: What is p-value? How do you interpret p-value = 0.03?

Answer:

P-value: Probability of observing data (or more extreme) IF null hypothesis is true.

Interpretation:

p-value = 0.03 means: "If null hypothesis were true (no difference between groups), there's 3% chance we'd see this data by random luck."
Since 3% < 5% (common threshold), we reject null hypothesis → Difference is statistically significant.

Example:

Null hypothesis: "Free shipping doesn't affect conversion rate"
A/B test: Control (no free shipping) = 2.1% conversion, Treatment (free shipping) = 2.5% conversion
p-value = 0.03 → "Difference (2.5% vs 2.1%) is unlikely due to random chance (only 3% probability) → Free shipping DOES increase conversion (reject null hypothesis)"

Follow-up: "p-value = 0.06. Should you deploy Treatment?"

p-value = 0.06 > 0.05 (threshold) → Not statistically significant (could be random luck)
BUT: Consider business context. If Treatment has 20% lift in conversion (huge business impact) and p=0.06 (marginal significance), might still deploy (business judgment > statistical threshold)

A/B Testing Questions

Q14: You run A/B test: Control = 2.0% conversion (n=10,000), Treatment = 2.2% conversion (n=10,000). Is this significant?

Answer (using two-proportion z-test):

code.pyPython

import scipy.stats as stats

# Data
n_control = 10000
n_treatment = 10000
conversions_control = 200  # 2.0% of 10,000
conversions_treatment = 220  # 2.2% of 10,000

# Two-proportion z-test
z_stat, p_value = stats.proportions_ztest(
    [conversions_treatment, conversions_control],
    [n_treatment, n_control]
)

print(f"Z-statistic: {z_stat:.2f}")
print(f"P-value: {p_value:.4f}")

if p_value < 0.05:
    print("Statistically significant (reject null hypothesis)")
else:
    print("Not statistically significant (fail to reject null hypothesis)")

Result: p-value ≈ 0.16 (> 0.05) → Not statistically significant (10% lift but sample size too small to be confident)

Follow-up: "How many users needed to detect 10% lift with 95% confidence, 80% power?"

code.pyPython

from statsmodels.stats.power import zt_ind_solve_power

# Calculate sample size
sample_size = zt_ind_solve_power(
    effect_size=0.10 / 0.02,  # Effect size = (2.2% - 2.0%) / 2.0% ≈ 5
    alpha=0.05,               # Significance level
    power=0.80                # Statistical power
)

print(f"Sample size per group: {sample_size:.0f}")

Result: ~15,000 users per group (30,000 total) needed to detect 10% lift with 80% power.

📈

Case Study Interview Questions

Case study interviews test: (1) Business thinking, (2) Analytical approach, (3) Communication.

Case Study 1: Sales Drop Investigation

Question: "Our e-commerce sales dropped 20% last month. How would you investigate?"

Strong answer (structured approach):

1. Clarify the question (ask before diving in):

Is 20% drop in revenue or number of orders?
Is drop across all categories or specific category (Electronics, Fashion)?
Is drop across all channels (website, app, offline) or specific channel?
Is drop across all geographies or specific region (North India, South India)?

2. Propose hypotheses (potential root causes):

Demand-side issues (customer behavior changed):
- Competitor launched sale (users shifted to Amazon/Myntra)
- Seasonality (festive season ended → organic drop)
- External event (economic downturn, election uncertainty)
Supply-side issues (our platform changed):
- Website bug (checkout broken, payment gateway issues)
- Pricing change (increased prices → reduced conversion)
- Marketing spend reduced (fewer paid ads → less traffic)

3. Data analysis approach (how to test hypotheses):

Step 1: Segment analysis (where is drop concentrated?)
- Revenue by category: Is drop in all categories or just Electronics?
- Revenue by channel: Is drop in app or website?
- Revenue by region: Is drop in specific cities (Bangalore, Delhi)?

Step 2: Funnel analysis (where are users dropping off?)
- Homepage visits → Product views → Add to cart → Checkout → Payment → Order
- Compare last month vs previous month (where is biggest drop?)

Step 3: Cohort analysis (is drop in new users or existing users?)
- New users: Reduced marketing → fewer acquisitions
- Existing users: Product quality issues → churn

Step 4: External factors (competitor analysis, Google Trends)
- Did competitor launch sale during same period?
- Google Trends: Search volume for "online shopping" (demand unchanged → issue is internal)

4. Recommendation (based on findings):

If drop due to website bug: Fix checkout flow (highest priority), recover lost orders with discount code
If drop due to competitor sale: Launch counter-promotion (flash sale, free shipping)
If drop due to seasonality: Accept drop (natural cycle), plan for next festive season

What interviewer looks for:

✅ Structured thinking (clarify → hypothesize → test → recommend)
✅ Business context (considers external factors, not just data)
✅ Actionable recommendations (specific next steps, not vague "investigate more")

Case Study 2: Product Recommendation

Question: "Design metrics to measure success of product recommendation engine."

Strong answer:

1. Primary metrics (direct impact):

Click-through rate (CTR): % of users who click recommended products
- Target: 5-8% (baseline: 2% for non-personalized recommendations)
Conversion rate: % of users who purchase recommended products
- Target: 1.5-2.5% (baseline: 0.8% for non-personalized)
Revenue from recommendations: $ generated from recommended products
- Target: 30-40% of total revenue attributed to recommendations

2. Secondary metrics (user engagement):

Average order value (AOV): Do recommendations increase basket size?
- Target: +15-20% AOV for users who click recommendations
Session duration: Do recommendations keep users on site longer?
- Target: +10-15% session duration

3. Guardrail metrics (don't hurt user experience):

Return rate: Are recommended products returned more often? (quality check)
- Target: Return rate ≤ overall return rate (recommendation shouldn't push low-quality products)
Customer satisfaction (NPS): Do users like recommendations?
- Target: NPS ≥ 50 for users who interacted with recommendations

4. Long-term metrics (retention):

Repeat purchase rate: Do recommendations drive retention?
- Target: Users who buy recommended products have 20-30% higher 30-day retention

5. Measurement approach:

A/B test: Control (no recommendations) vs Treatment (personalized recommendations)
Cohort analysis: Compare users who interacted with recommendations vs didn't
Attribution: Track revenue source (organic search, recommendations, email)

What interviewer looks for:

✅ Balanced scorecard (not just revenue — also considers UX, retention)
✅ Trade-offs (maximizing CTR might hurt conversion if recommendations are irrelevant)
✅ Practical measurement (how to actually track metrics in production)

💬

Behavioral Interview Questions

Behavioral questions assess: (1) Past experience, (2) Problem-solving, (3) Culture fit.

Use STAR method: Situation, Task, Action, Result

Q15: "Tell me about a time you used data to influence a business decision."

Weak answer: "I analyzed sales data and found insights. I shared with team and they implemented my recommendations." → Vague (no specifics: what insights? what recommendations? what impact?)

Strong answer (STAR method):

Situation: "At Flipkart, our Grocery vertical's cart abandonment rate was 45% (industry avg: 30%). Leadership wanted to reduce abandonment to increase revenue."

Task: "I was tasked with identifying top abandonment reasons and recommending solutions. Goal: Reduce abandonment to 35% within Q4 2024."

Action: "I performed funnel analysis using SQL, identifying that 40% of abandonment happened at payment step (users clicked 'Place Order' but didn't complete payment). I hypothesized: payment gateway timeout (UPI sessions expire after 5 mins, users take 8-10 mins to complete transaction). I ran A/B test: Control (5-min timeout) vs Treatment (auto-retry with 10-min extended window). Results: Treatment reduced abandonment from 45% → 38%."

Result: "I presented findings to VP Product with recommendation: Deploy Treatment + add 'Save for Later' button at payment step (users can resume transaction). Implementation reduced abandonment to 33% (better than 35% goal), adding ₹8 crore annual revenue. Promoted to Senior Analyst for driving measurable business impact."

Why strong:

✅ Quantified problem (45% abandonment, ₹X revenue at stake)
✅ Clear methodology (funnel analysis, hypothesis, A/B test)
✅ Actionable recommendation (deploy Treatment, add feature)
✅ Quantified impact (33% abandonment, ₹8 crore revenue)
✅ Shows ownership (led initiative, influenced VP decision, got promoted)

Q16: "Tell me about a time you disagreed with a stakeholder."

Strong answer:

Situation: "At Swiggy, Marketing team wanted to launch ₹50 discount coupon for all users to boost orders during slow month (Feb 2024)."

Task: "I disagreed because data showed: (1) 80% of users order without coupons (discount would cannibalize existing revenue), (2) Discount attracts low-value users (single order, then churn). I needed to convince Marketing VP to change strategy without damaging relationship."

Action: "I scheduled 1:1 with Marketing VP, presented analysis: 'Blanket ₹50 discount would cost ₹2 crore (40K daily orders × ₹50 × 30 days), but only increase orders by 10% (4K additional orders/day × ₹150 AOV = ₹1.8 crore revenue → Net loss: ₹2 crore - ₹1.8 crore = ₹-0.2 crore). Instead, I proposed targeted discount: (1) Inactive users (haven't ordered in 30 days) → ₹50 coupon (win-back campaign), (2) Active users → No discount (already ordering). Cost: ₹0.5 crore (10K inactive users × ₹50 × 30% redemption), Revenue: ₹0.8 crore (10K × 30% × ₹250 AOV → Net gain: ₹0.3 crore).'"

Result: "Marketing VP agreed to targeted approach. Feb campaign generated ₹0.3 crore net revenue (vs ₹-0.2 crore loss with blanket discount). VP credited me in company-wide email for 'data-driven thinking.' Built trust with Marketing team, now consulted on all campaign decisions."

Why strong:

✅ Shows respect (didn't say "Marketing is wrong" — presented data respectfully)
✅ Quantified trade-offs (blanket discount loses ₹0.2 crore vs targeted gains ₹0.3 crore)
✅ Proposed alternative (not just criticism — offered better solution)
✅ Positive outcome (stakeholder agreed, revenue increased, relationship strengthened)

Common Behavioral Questions (Quick Answers)

Q: Why do you want to work here?

Research company (mission, products, culture)
"I'm excited about Flipkart's mission to democratize commerce in India. I love analyzing e-commerce data (my portfolio project: Zomato analysis with 50K orders). I want to work on large-scale problems (450M users) and learn from strong analytics team."

Q: What's your greatest weakness?

Honest but show growth: "Early in career, I focused too much on technical perfection (spending 3 days optimizing SQL query for 5% performance gain). I learned: Business impact > Technical perfection. Now I prioritize: Is this analysis driving decision (high impact)? If yes, invest time; if no, use quick-and-dirty approach (80/20 rule)."

Q: Where do you see yourself in 5 years?

"In 2-3 years: Senior Data Analyst leading projects, mentoring junior analysts. In 4-5 years: Analytics Manager managing team of 5-8 analysts, defining analytics strategy for business unit (e.g., Grocery vertical). Long-term: Director of Analytics, shaping company-wide data culture."

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}