Interview Stages: What to Expect
Typical Data Analyst Interview Process
Stage 1: HR Screening (30 min phone call)
- Background, motivation, salary expectations
- Basic technical questions ("What's your experience with SQL?")
- Outcome: 60% pass โ Move to Stage 2
Stage 2: SQL/Technical Test (60-90 min)
- Live coding (write SQL queries on screen) or take-home (48-72 hours)
- 3-5 problems: Joins, aggregations, window functions
- Outcome: 40% pass โ Move to Stage 3
Stage 3: Take-Home Assignment (1 week)
- Real business problem: "Analyze this dataset, create dashboard, present insights"
- Deliverable: Jupyter notebook + Tableau dashboard + presentation (PDF/PPT)
- Outcome: 50% pass โ Move to Stage 4
Stage 4: Case Study Interview (60-90 min)
- Interviewer gives business problem: "Our sales dropped 20% last month. How would you investigate?"
- You: Ask clarifying questions โ Propose analysis approach โ Walk through methodology
- Outcome: 60% pass โ Move to Stage 5
Stage 5: Behavioral + Culture Fit (30-60 min)
- "Tell me about a time you disagreed with a stakeholder"
- "Why do you want to work here?"
- Outcome: 70% pass โ Offer
Total timeline: 2-6 weeks (varies by company)
Success rates (cumulative): 60% ร 40% ร 50% ร 60% ร 70% = 10% overall (10 applicants โ 1 offer). Prepare thoroughly for each stage to beat the odds.
SQL Interview Questions (Beginner to Advanced)
Beginner SQL Questions
Q1: Write a query to find the top 5 products by total revenue.
Schema:
orders (order_id, product_id, quantity, price, order_date)Answer:
SELECT
product_id,
SUM(quantity * price) AS total_revenue
FROM orders
GROUP BY product_id
ORDER BY total_revenue DESC
LIMIT 5;Follow-up: "What if you want top 5 products per category?"
WITH product_revenue AS (
SELECT
category_id,
product_id,
SUM(quantity * price) AS total_revenue
FROM orders
JOIN products USING (product_id)
GROUP BY category_id, product_id
),
ranked AS (
SELECT
category_id,
product_id,
total_revenue,
ROW_NUMBER() OVER (PARTITION BY category_id ORDER BY total_revenue DESC) AS rank
FROM product_revenue
)
SELECT category_id, product_id, total_revenue
FROM ranked
WHERE rank <= 5;Q2: Calculate month-over-month growth rate for revenue.
Schema:
orders (order_id, revenue, order_date)Answer:
WITH monthly_revenue AS (
SELECT
DATE_TRUNC('month', order_date) AS month,
SUM(revenue) AS total_revenue
FROM orders
GROUP BY month
)
SELECT
month,
total_revenue,
LAG(total_revenue) OVER (ORDER BY month) AS prev_month_revenue,
(total_revenue - LAG(total_revenue) OVER (ORDER BY month)) * 100.0 /
LAG(total_revenue) OVER (ORDER BY month) AS growth_rate
FROM monthly_revenue;Key concept: Use LAG() window function to access previous row's value.
Intermediate SQL Questions
Q3: Find users who made purchases in Jan 2024 but NOT in Feb 2024 (churned users).
Schema:
orders (order_id, user_id, order_date)Answer:
SELECT DISTINCT user_id
FROM orders
WHERE DATE_TRUNC('month', order_date) = '2024-01-01'
AND user_id NOT IN (
SELECT DISTINCT user_id
FROM orders
WHERE DATE_TRUNC('month', order_date) = '2024-02-01'
);Better approach (avoids NOT IN with NULL issues):
SELECT DISTINCT jan.user_id
FROM orders jan
LEFT JOIN orders feb
ON jan.user_id = feb.user_id
AND DATE_TRUNC('month', feb.order_date) = '2024-02-01'
WHERE DATE_TRUNC('month', jan.order_date) = '2024-01-01'
AND feb.user_id IS NULL;Q4: Calculate cumulative sum of revenue by day.
Answer:
SELECT
order_date,
SUM(revenue) AS daily_revenue,
SUM(SUM(revenue)) OVER (ORDER BY order_date) AS cumulative_revenue
FROM orders
GROUP BY order_date
ORDER BY order_date;Key concept: Window function SUM() OVER (ORDER BY ...) calculates running total.
Advanced SQL Questions
Q5: Find the 2nd highest salary in each department.
Schema:
employees (employee_id, department_id, salary)Answer:
WITH ranked_salaries AS (
SELECT
department_id,
salary,
DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS rank
FROM employees
)
SELECT department_id, salary
FROM ranked_salaries
WHERE rank = 2;Why DENSE_RANK() not ROW_NUMBER()?
- If 2 employees have same highest salary (โน10L),
DENSE_RANK()gives both rank 1, next salary is rank 2 ROW_NUMBER()gives ranks 1, 2, 3 (skips ties) โ Might miss 2nd highest salary
Q6: Calculate retention rate (% of Month 0 users who returned in Month 1, Month 2, ...).
Schema:
orders (user_id, order_date)Answer:
WITH user_cohorts AS (
SELECT
user_id,
DATE_TRUNC('month', MIN(order_date)) AS cohort_month
FROM orders
GROUP BY user_id
),
user_activity AS (
SELECT
uc.cohort_month,
DATE_TRUNC('month', o.order_date) AS activity_month,
COUNT(DISTINCT o.user_id) AS active_users
FROM user_cohorts uc
JOIN orders o ON uc.user_id = o.user_id
GROUP BY uc.cohort_month, activity_month
),
cohort_sizes AS (
SELECT
cohort_month,
COUNT(DISTINCT user_id) AS cohort_size
FROM user_cohorts
GROUP BY cohort_month
)
SELECT
ua.cohort_month,
ua.activity_month,
EXTRACT(MONTH FROM AGE(ua.activity_month, ua.cohort_month)) AS months_since_cohort,
ua.active_users,
cs.cohort_size,
ua.active_users * 100.0 / cs.cohort_size AS retention_rate
FROM user_activity ua
JOIN cohort_sizes cs ON ua.cohort_month = cs.cohort_month
ORDER BY ua.cohort_month, ua.activity_month;Output:
cohort_month | activity_month | months_since_cohort | active_users | cohort_size | retention_rate
-------------|----------------|---------------------|--------------|-------------|---------------
2024-01-01 | 2024-01-01 | 0 | 1000 | 1000 | 100.0
2024-01-01 | 2024-02-01 | 1 | 600 | 1000 | 60.0
2024-01-01 | 2024-03-01 | 2 | 400 | 1000 | 40.0
Key concept: Cohort analysis using self-joins and window functions.
Q7: Find users who purchased Product A and Product B together in same order (market basket analysis).
Schema:
order_items (order_id, product_id)Answer:
SELECT DISTINCT o1.order_id
FROM order_items o1
JOIN order_items o2
ON o1.order_id = o2.order_id
AND o1.product_id < o2.product_id
WHERE o1.product_id = 'Product_A'
AND o2.product_id = 'Product_B';Follow-up: "Find top 10 product pairs most frequently bought together"
SELECT
o1.product_id AS product_a,
o2.product_id AS product_b,
COUNT(DISTINCT o1.order_id) AS times_bought_together
FROM order_items o1
JOIN order_items o2
ON o1.order_id = o2.order_id
AND o1.product_id < o2.product_id
GROUP BY o1.product_id, o2.product_id
ORDER BY times_bought_together DESC
LIMIT 10;โ ๏ธ CheckpointQuiz error: Missing or invalid options array
Python Interview Questions
Data Manipulation (pandas)
Q8: Remove duplicate rows from DataFrame.
Question: "You have DataFrame with 100K rows, 5% duplicates. How to remove?"
Answer:
import pandas as pd
# Sample data
df = pd.DataFrame({
'user_id': [1, 2, 2, 3, 4, 4],
'order_date': ['2024-01-01', '2024-01-02', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-04'],
'revenue': [100, 200, 200, 150, 300, 300]
})
# Remove exact duplicates (all columns match)
df_deduped = df.drop_duplicates()
# Remove duplicates based on specific columns (keep first occurrence)
df_deduped = df.drop_duplicates(subset=['user_id', 'order_date'], keep='first')
print(f"Original: {len(df)} rows, Deduped: {len(df_deduped)} rows")Follow-up: "What if you want to keep the row with highest revenue (not first occurrence)?"
# Sort by revenue descending, then drop duplicates (keeps highest revenue row)
df_sorted = df.sort_values('revenue', ascending=False)
df_deduped = df_sorted.drop_duplicates(subset=['user_id', 'order_date'], keep='first')Q9: Calculate rolling 7-day average of sales.
Answer:
import pandas as pd
df = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=30),
'sales': [100, 120, 110, 130, 140, 135, 145, 150, 155, 160] * 3
})
# Calculate 7-day rolling average
df['sales_7day_avg'] = df['sales'].rolling(window=7).mean()
# Alternative: Rolling with min_periods (handles first 6 days with <7 data points)
df['sales_7day_avg'] = df['sales'].rolling(window=7, min_periods=1).mean()
print(df.head(10))Data Analysis Questions
Q10: Find top 10% of customers by revenue (whale customers).
Answer:
import pandas as pd
df = pd.DataFrame({
'customer_id': range(1, 101),
'revenue': [1000 + i*10 for i in range(100)]
})
# Calculate 90th percentile
p90 = df['revenue'].quantile(0.90)
# Filter top 10% customers
whale_customers = df[df['revenue'] >= p90]
print(f"90th percentile: โน{p90}")
print(f"Whale customers: {len(whale_customers)} ({len(whale_customers)/len(df)*100:.0f}%)")
print(whale_customers.head())Q11: Pivot table: Revenue by Product Category and Month.
Answer:
import pandas as pd
df = pd.DataFrame({
'order_date': pd.date_range('2024-01-01', periods=100, freq='D'),
'category': ['Electronics', 'Fashion', 'Home'] * 33 + ['Electronics'],
'revenue': [1000 + i*10 for i in range(100)]
})
# Extract month from date
df['month'] = df['order_date'].dt.to_period('M')
# Create pivot table
pivot = df.pivot_table(
values='revenue',
index='month',
columns='category',
aggfunc='sum',
fill_value=0
)
print(pivot)Output:
category Electronics Fashion Home
month
2024-01 XXX XXX XXX
2024-02 XXX XXX XXX
2024-03 XXX XXX XXX
Statistics Interview Questions
Descriptive Statistics
Q12: Explain the difference between mean, median, and mode. When to use each?
Answer:
-
Mean: Average (sum รท count). Use when data is normally distributed, no outliers.
- Example: Average order value = โน1,500 (most orders are โน1,000-โน2,000)
-
Median: Middle value when sorted. Use when data has outliers.
- Example: Salary data (CEO earns โน1 crore, 99 employees earn โน5L โ Mean = โน14L, Median = โน5L). Median is more representative.
-
Mode: Most frequent value. Use for categorical data.
- Example: Most common payment method = UPI (50% of transactions)
Follow-up: "Revenue data: [100, 120, 110, 130, 5000]. Should you use mean or median?"
- Mean = (100+120+110+130+5000)/5 = โน1,092 (distorted by outlier โน5,000)
- Median = 120 (middle value, ignores outlier) โ Use median (better represents typical revenue)
Q13: What is p-value? How do you interpret p-value = 0.03?
Answer:
P-value: Probability of observing data (or more extreme) IF null hypothesis is true.
Interpretation:
- p-value = 0.03 means: "If null hypothesis were true (no difference between groups), there's 3% chance we'd see this data by random luck."
- Since 3% < 5% (common threshold), we reject null hypothesis โ Difference is statistically significant.
Example:
- Null hypothesis: "Free shipping doesn't affect conversion rate"
- A/B test: Control (no free shipping) = 2.1% conversion, Treatment (free shipping) = 2.5% conversion
- p-value = 0.03 โ "Difference (2.5% vs 2.1%) is unlikely due to random chance (only 3% probability) โ Free shipping DOES increase conversion (reject null hypothesis)"
Follow-up: "p-value = 0.06. Should you deploy Treatment?"
- p-value = 0.06 > 0.05 (threshold) โ Not statistically significant (could be random luck)
- BUT: Consider business context. If Treatment has 20% lift in conversion (huge business impact) and p=0.06 (marginal significance), might still deploy (business judgment > statistical threshold)
A/B Testing Questions
Q14: You run A/B test: Control = 2.0% conversion (n=10,000), Treatment = 2.2% conversion (n=10,000). Is this significant?
Answer (using two-proportion z-test):
import scipy.stats as stats
# Data
n_control = 10000
n_treatment = 10000
conversions_control = 200 # 2.0% of 10,000
conversions_treatment = 220 # 2.2% of 10,000
# Two-proportion z-test
z_stat, p_value = stats.proportions_ztest(
[conversions_treatment, conversions_control],
[n_treatment, n_control]
)
print(f"Z-statistic: {z_stat:.2f}")
print(f"P-value: {p_value:.4f}")
if p_value < 0.05:
print("Statistically significant (reject null hypothesis)")
else:
print("Not statistically significant (fail to reject null hypothesis)")Result: p-value โ 0.16 (> 0.05) โ Not statistically significant (10% lift but sample size too small to be confident)
Follow-up: "How many users needed to detect 10% lift with 95% confidence, 80% power?"
from statsmodels.stats.power import zt_ind_solve_power
# Calculate sample size
sample_size = zt_ind_solve_power(
effect_size=0.10 / 0.02, # Effect size = (2.2% - 2.0%) / 2.0% โ 5
alpha=0.05, # Significance level
power=0.80 # Statistical power
)
print(f"Sample size per group: {sample_size:.0f}")Result: ~15,000 users per group (30,000 total) needed to detect 10% lift with 80% power.
Case Study Interview Questions
Case study interviews test: (1) Business thinking, (2) Analytical approach, (3) Communication.
Case Study 1: Sales Drop Investigation
Question: "Our e-commerce sales dropped 20% last month. How would you investigate?"
Strong answer (structured approach):
1. Clarify the question (ask before diving in):
- Is 20% drop in revenue or number of orders?
- Is drop across all categories or specific category (Electronics, Fashion)?
- Is drop across all channels (website, app, offline) or specific channel?
- Is drop across all geographies or specific region (North India, South India)?
2. Propose hypotheses (potential root causes):
-
Demand-side issues (customer behavior changed):
- Competitor launched sale (users shifted to Amazon/Myntra)
- Seasonality (festive season ended โ organic drop)
- External event (economic downturn, election uncertainty)
-
Supply-side issues (our platform changed):
- Website bug (checkout broken, payment gateway issues)
- Pricing change (increased prices โ reduced conversion)
- Marketing spend reduced (fewer paid ads โ less traffic)
3. Data analysis approach (how to test hypotheses):
Step 1: Segment analysis (where is drop concentrated?)
- Revenue by category: Is drop in all categories or just Electronics?
- Revenue by channel: Is drop in app or website?
- Revenue by region: Is drop in specific cities (Bangalore, Delhi)?
Step 2: Funnel analysis (where are users dropping off?)
- Homepage visits โ Product views โ Add to cart โ Checkout โ Payment โ Order
- Compare last month vs previous month (where is biggest drop?)
Step 3: Cohort analysis (is drop in new users or existing users?)
- New users: Reduced marketing โ fewer acquisitions
- Existing users: Product quality issues โ churn
Step 4: External factors (competitor analysis, Google Trends)
- Did competitor launch sale during same period?
- Google Trends: Search volume for "online shopping" (demand unchanged โ issue is internal)
4. Recommendation (based on findings):
- If drop due to website bug: Fix checkout flow (highest priority), recover lost orders with discount code
- If drop due to competitor sale: Launch counter-promotion (flash sale, free shipping)
- If drop due to seasonality: Accept drop (natural cycle), plan for next festive season
What interviewer looks for:
- โ Structured thinking (clarify โ hypothesize โ test โ recommend)
- โ Business context (considers external factors, not just data)
- โ Actionable recommendations (specific next steps, not vague "investigate more")
Case Study 2: Product Recommendation
Question: "Design metrics to measure success of product recommendation engine."
Strong answer:
1. Primary metrics (direct impact):
- Click-through rate (CTR): % of users who click recommended products
- Target: 5-8% (baseline: 2% for non-personalized recommendations)
- Conversion rate: % of users who purchase recommended products
- Target: 1.5-2.5% (baseline: 0.8% for non-personalized)
- Revenue from recommendations: $ generated from recommended products
- Target: 30-40% of total revenue attributed to recommendations
2. Secondary metrics (user engagement):
- Average order value (AOV): Do recommendations increase basket size?
- Target: +15-20% AOV for users who click recommendations
- Session duration: Do recommendations keep users on site longer?
- Target: +10-15% session duration
3. Guardrail metrics (don't hurt user experience):
- Return rate: Are recommended products returned more often? (quality check)
- Target: Return rate โค overall return rate (recommendation shouldn't push low-quality products)
- Customer satisfaction (NPS): Do users like recommendations?
- Target: NPS โฅ 50 for users who interacted with recommendations
4. Long-term metrics (retention):
- Repeat purchase rate: Do recommendations drive retention?
- Target: Users who buy recommended products have 20-30% higher 30-day retention
5. Measurement approach:
- A/B test: Control (no recommendations) vs Treatment (personalized recommendations)
- Cohort analysis: Compare users who interacted with recommendations vs didn't
- Attribution: Track revenue source (organic search, recommendations, email)
What interviewer looks for:
- โ Balanced scorecard (not just revenue โ also considers UX, retention)
- โ Trade-offs (maximizing CTR might hurt conversion if recommendations are irrelevant)
- โ Practical measurement (how to actually track metrics in production)
Behavioral Interview Questions
Behavioral questions assess: (1) Past experience, (2) Problem-solving, (3) Culture fit.
Use STAR method: Situation, Task, Action, Result
Q15: "Tell me about a time you used data to influence a business decision."
Weak answer: "I analyzed sales data and found insights. I shared with team and they implemented my recommendations." โ Vague (no specifics: what insights? what recommendations? what impact?)
Strong answer (STAR method):
Situation: "At Flipkart, our Grocery vertical's cart abandonment rate was 45% (industry avg: 30%). Leadership wanted to reduce abandonment to increase revenue."
Task: "I was tasked with identifying top abandonment reasons and recommending solutions. Goal: Reduce abandonment to 35% within Q4 2024."
Action: "I performed funnel analysis using SQL, identifying that 40% of abandonment happened at payment step (users clicked 'Place Order' but didn't complete payment). I hypothesized: payment gateway timeout (UPI sessions expire after 5 mins, users take 8-10 mins to complete transaction). I ran A/B test: Control (5-min timeout) vs Treatment (auto-retry with 10-min extended window). Results: Treatment reduced abandonment from 45% โ 38%."
Result: "I presented findings to VP Product with recommendation: Deploy Treatment + add 'Save for Later' button at payment step (users can resume transaction). Implementation reduced abandonment to 33% (better than 35% goal), adding โน8 crore annual revenue. Promoted to Senior Analyst for driving measurable business impact."
Why strong:
- โ Quantified problem (45% abandonment, โนX revenue at stake)
- โ Clear methodology (funnel analysis, hypothesis, A/B test)
- โ Actionable recommendation (deploy Treatment, add feature)
- โ Quantified impact (33% abandonment, โน8 crore revenue)
- โ Shows ownership (led initiative, influenced VP decision, got promoted)
Q16: "Tell me about a time you disagreed with a stakeholder."
Strong answer:
Situation: "At Swiggy, Marketing team wanted to launch โน50 discount coupon for all users to boost orders during slow month (Feb 2024)."
Task: "I disagreed because data showed: (1) 80% of users order without coupons (discount would cannibalize existing revenue), (2) Discount attracts low-value users (single order, then churn). I needed to convince Marketing VP to change strategy without damaging relationship."
Action: "I scheduled 1:1 with Marketing VP, presented analysis: 'Blanket โน50 discount would cost โน2 crore (40K daily orders ร โน50 ร 30 days), but only increase orders by 10% (4K additional orders/day ร โน150 AOV = โน1.8 crore revenue โ Net loss: โน2 crore - โน1.8 crore = โน-0.2 crore). Instead, I proposed targeted discount: (1) Inactive users (haven't ordered in 30 days) โ โน50 coupon (win-back campaign), (2) Active users โ No discount (already ordering). Cost: โน0.5 crore (10K inactive users ร โน50 ร 30% redemption), Revenue: โน0.8 crore (10K ร 30% ร โน250 AOV โ Net gain: โน0.3 crore).'"
Result: "Marketing VP agreed to targeted approach. Feb campaign generated โน0.3 crore net revenue (vs โน-0.2 crore loss with blanket discount). VP credited me in company-wide email for 'data-driven thinking.' Built trust with Marketing team, now consulted on all campaign decisions."
Why strong:
- โ Shows respect (didn't say "Marketing is wrong" โ presented data respectfully)
- โ Quantified trade-offs (blanket discount loses โน0.2 crore vs targeted gains โน0.3 crore)
- โ Proposed alternative (not just criticism โ offered better solution)
- โ Positive outcome (stakeholder agreed, revenue increased, relationship strengthened)
Common Behavioral Questions (Quick Answers)
Q: Why do you want to work here?
- Research company (mission, products, culture)
- "I'm excited about Flipkart's mission to democratize commerce in India. I love analyzing e-commerce data (my portfolio project: Zomato analysis with 50K orders). I want to work on large-scale problems (450M users) and learn from strong analytics team."
Q: What's your greatest weakness?
- Honest but show growth: "Early in career, I focused too much on technical perfection (spending 3 days optimizing SQL query for 5% performance gain). I learned: Business impact > Technical perfection. Now I prioritize: Is this analysis driving decision (high impact)? If yes, invest time; if no, use quick-and-dirty approach (80/20 rule)."
Q: Where do you see yourself in 5 years?
- "In 2-3 years: Senior Data Analyst leading projects, mentoring junior analysts. In 4-5 years: Analytics Manager managing team of 5-8 analysts, defining analytics strategy for business unit (e.g., Grocery vertical). Long-term: Director of Analytics, shaping company-wide data culture."
โ ๏ธ FinalQuiz error: Missing or invalid questions array
โ ๏ธ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}