Topic 98 of

Python Practice Problems — 20 Data Analysis Exercises

Python interviews test pandas fluency: groupby, merge, pivot, handling missing values. Practice these 20 problems to master data manipulation faster than reading documentation.

📚Intermediate
⏱️20 min
5 quizzes
🐍

Setup & Dataset

Installation:

$ terminalBash
pip install pandas numpy matplotlib seaborn

Sample Dataset:

code.pyPython
import pandas as pd
import numpy as np

# Create sample sales data
data = {
    'date': pd.date_range('2025-01-01', periods=100),
    'product': np.random.choice(['A', 'B', 'C'], 100),
    'category': np.random.choice(['Electronics', 'Fashion'], 100),
    'quantity': np.random.randint(1, 20, 100),
    'price': np.random.randint(100, 1000, 100),
    'city': np.random.choice(['Mumbai', 'Delhi', 'Bangalore'], 100)
}

df = pd.DataFrame(data)
df['revenue'] = df['quantity'] * df['price']
🟢

Beginner Problems (1-8)

Problem 1: Basic filtering

Question: Filter rows where revenue > 5000 and city is Mumbai.

Solution:

code.pyPython
result = df[(df['revenue'] > 5000) & (df['city'] == 'Mumbai')]
print(result)

Problem 2: Group by and aggregate

Question: Calculate total revenue and average quantity by product.

Solution:

code.pyPython
result = df.groupby('product').agg({
    'revenue': 'sum',
    'quantity': 'mean'
}).round(2)
print(result)

Problem 3: Handle missing values

Question: Replace missing values in 'quantity' with column median.

Solution:

code.pyPython
df['quantity'].fillna(df['quantity'].median(), inplace=True)

# Alternative: drop rows with ANY null
df_clean = df.dropna()

Problem 4: Create new column

Question: Create 'price_category' column: Low (<500), Medium (500-800), High (>800).

Solution:

code.pyPython
df['price_category'] = pd.cut(
    df['price'],
    bins=[0, 500, 800, float('inf')],
    labels=['Low', 'Medium', 'High']
)

# Alternative using np.where or apply
df['price_category'] = df['price'].apply(
    lambda x: 'Low' if x < 500 else ('Medium' if x < 800 else 'High')
)

⚠️ CheckpointQuiz error: Missing or invalid options array

🟡

Intermediate Problems (9-15)

Problem 9: Pivot table

Question: Create pivot table showing total revenue by product (rows) and city (columns).

Solution:

code.pyPython
pivot = df.pivot_table(
    values='revenue',
    index='product',
    columns='city',
    aggfunc='sum',
    fill_value=0
)
print(pivot)

Problem 10: Merge datasets

Question: Merge sales data with product info on product ID.

Solution:

code.pyPython
# Sample product info
product_info = pd.DataFrame({
    'product': ['A', 'B', 'C'],
    'product_name': ['Widget', 'Gadget', 'Tool'],
    'cost': [50, 75, 100]
})

merged = df.merge(
    product_info,
    on='product',
    how='left'  # Keep all sales records
)

Problem 11: Time series resampling

Question: Calculate monthly total revenue from daily data.

Solution:

code.pyPython
df_monthly = df.set_index('date').resample('M')['revenue'].sum()
print(df_monthly)

# With additional stats
monthly_stats = df.set_index('date').resample('M').agg({
    'revenue': 'sum',
    'quantity': 'mean',
    'product': 'count'
})

Problem 12: Running total

Question: Calculate cumulative revenue over time.

Solution:

code.pyPython
df_sorted = df.sort_values('date')
df_sorted['cumulative_revenue'] = df_sorted['revenue'].cumsum()

Problem 13: Remove duplicates

Question: Remove duplicate rows based on date + product combination.

Solution:

code.pyPython
df_unique = df.drop_duplicates(subset=['date', 'product'], keep='first')
print(f"Removed {len(df) - len(df_unique)} duplicates")
🔴

Advanced Problems (16-20)

Problem 16: Complex aggregation

Question: For each city, find top 3 products by revenue.

Solution:

code.pyPython
top_products = (
    df.groupby(['city', 'product'])['revenue']
    .sum()
    .reset_index()
    .sort_values(['city', 'revenue'], ascending=[True, False])
    .groupby('city')
    .head(3)
)
print(top_products)

Problem 17: Apply custom function

Question: Calculate profit margin (assuming 30% cost).

Solution:

code.pyPython
def calculate_profit_margin(row):
    cost = row['revenue'] * 0.30
    profit = row['revenue'] - cost
    return (profit / row['revenue']) * 100

df['profit_margin'] = df.apply(calculate_profit_margin, axis=1)

# Vectorized (faster) alternative
df['profit_margin'] = ((df['revenue'] - df['revenue'] * 0.30) / df['revenue']) * 100

Problem 18: Handle outliers

Question: Cap revenue at 99th percentile to remove outliers.

Solution:

code.pyPython
upper_limit = df['revenue'].quantile(0.99)
df['revenue_capped'] = df['revenue'].clip(upper=upper_limit)

# Or remove outliers using IQR
Q1 = df['revenue'].quantile(0.25)
Q3 = df['revenue'].quantile(0.75)
IQR = Q3 - Q1
df_no_outliers = df[
    (df['revenue'] >= Q1 - 1.5 * IQR) &
    (df['revenue'] <= Q3 + 1.5 * IQR)
]

Problem 19: Percentage change

Question: Calculate month-over-month revenue growth rate.

Solution:

code.pyPython
monthly_rev = df.set_index('date').resample('M')['revenue'].sum()
monthly_rev_growth = monthly_rev.pct_change() * 100

print(f"Month-over-month growth:\n{monthly_rev_growth}")

Problem 20: Correlation analysis

Question: Find correlation between price and quantity sold.

Solution:

code.pyPython
correlation = df['price'].corr(df['quantity'])
print(f"Correlation: {correlation:.3f}")

# Correlation matrix for multiple columns
corr_matrix = df[['price', 'quantity', 'revenue']].corr()
print(corr_matrix)

# Visualize
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()
💡

Python Practice Tips

Common pandas Operations:

| Task | Method | Example | |------|--------|---------| | Filter rows | Boolean indexing | df[df['revenue'] > 1000] | | Select columns | Double brackets | df[['product', 'revenue']] | | Group & aggregate | groupby + agg | df.groupby('city').agg({'revenue': 'sum'}) | | Sort | sort_values | df.sort_values('revenue', ascending=False) | | Handle nulls | fillna, dropna | df['quantity'].fillna(0) | | Merge tables | merge | df1.merge(df2, on='id', how='left') | | Pivot | pivot_table | df.pivot_table(values='revenue', index='product') | | Apply function | apply | df['new_col'] = df.apply(func, axis=1) |

Performance Tips:

Vectorize instead of loops:

code.pyPython
# Slow (loop)
for i, row in df.iterrows():
    df.at[i, 'total'] = row['price'] * row['quantity']

# Fast (vectorized)
df['total'] = df['price'] * df['quantity']

Use query() for readability:

code.pyPython
# Traditional
df[(df['revenue'] > 5000) & (df['city'] == 'Mumbai')]

# Query (cleaner for complex conditions)
df.query('revenue > 5000 and city == "Mumbai"')

Chain operations:

code.pyPython
result = (
    df
    .query('revenue > 1000')
    .groupby('product')['revenue']
    .sum()
    .sort_values(ascending=False)
    .head(5)
)

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}