Project Overview — What We'll Build
In this project, you'll analyze real restaurant data from Zomato to answer business questions:
Key Questions We'll Answer:
- Which cities have the highest-rated restaurants?
- What's the relationship between price range and rating?
- Which cuisines are most popular in each city?
- Do restaurants accepting online orders have higher ratings?
- What factors predict a restaurant's success?
Skills You'll Practice:
- Loading and exploring messy CSV data
- Handling missing values and data type issues
- Cleaning text data (restaurant names, cuisines)
- Grouping and aggregating by multiple dimensions
- Creating visualizations to communicate insights
- Drawing business conclusions from data
Tools:
- Pandas for data manipulation
- NumPy for numerical operations
- Matplotlib and Seaborn for visualization
Dataset Information
Source: Kaggle — Zomato Bangalore Restaurants
Size: ~56,000 restaurants from Bangalore
Columns:
name— Restaurant nameonline_order— Accepts online orders (Yes/No)book_table— Table booking available (Yes/No)rate— Average rating (e.g., "4.1/5", "3.8 /5")votes— Number of votes/reviewslocation— Area/neighborhoodrest_type— Restaurant type (Casual Dining, Cafe, etc.)cuisines— Cuisines offered (comma-separated)approx_cost(for two people)— Estimated cost for twolisted_in(type)— Meal type (Delivery, Dine-out, etc.)
Download: Get the dataset from Kaggle or use this direct CSV link (example mirror).
Step 1 — Setup and Load Data
Install Required Libraries
pip install pandas numpy matplotlib seabornLoad the Dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Configure visualization settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
# Load data
df = pd.read_csv('zomato.csv')
# First look at the data
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")Expected Output:
name online_order book_table rate votes \
0 Jalsa Yes Yes 4.1/5 775
1 Spice Elephant Yes No 4.1/5 787
2 San Churro Yes No 3.8/5 918
3 Addhuri Udupi Bhojana No No 3.7/5 88
4 Grand Village No No 3.8/5 166
location rest_type cuisines approx_cost(for two people) \
0 Banashankari Casual Dining North Indian, Mughlai, Chinese 800
1 Banashankari Casual Dining North Indian, Chinese, Biryani 800
2 Banashankari Cafe, Casual Dining Cafe, Mexican 800
3 Banashankari Quick Bites South Indian, North Indian 300
4 Basavanagudi Casual Dining North Indian, Rajasthani 600
listed_in(type)
0 Buffet
1 Buffet
2 Buffet
3 Delivery
4 Dine-out
Dataset shape: (51717, 11)
Get Initial Insights
# Data info
print(df.info())
# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())
# Statistical summary
print("\n" + "="*50)
print(df.describe())Step 2 — Clean the Data
Real-world data is messy. Let's clean it systematically.
Fix the Rating Column
The rate column has formats like "4.1/5", "NEW", "-", "3.8 /5". Let's standardize it:
# View unique rating formats
print(df['rate'].value_counts().head(10))
# Clean ratings: extract numeric value
def clean_rating(rate):
if pd.isna(rate):
return np.nan
if rate == 'NEW' or rate == '-':
return np.nan
# Extract numeric part (e.g., "4.1/5" → 4.1)
try:
return float(rate.split('/')[0].strip())
except:
return np.nan
df['rating'] = df['rate'].apply(clean_rating)
# Check the result
print(f"\nRatings cleaned. Sample values:")
print(df[['rate', 'rating']].head(10))
# Drop original rate column
df = df.drop(columns=['rate'])Fix the Cost Column
# Current format: "800" (string with commas for large values)
print(df['approx_cost(for two people)'].value_counts().head())
# Clean cost: remove commas, convert to numeric
df['cost_for_two'] = pd.to_numeric(
df['approx_cost(for two people)'].str.replace(',', ''),
errors='coerce'
)
# Drop original column
df = df.drop(columns=['approx_cost(for two people)'])
# Check for outliers
print(f"\nCost statistics:")
print(df['cost_for_two'].describe())Handle Missing Values
# Missing value summary
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
'Missing': missing,
'Percentage': missing_pct
}).sort_values('Missing', ascending=False)
print(missing_df[missing_df['Missing'] > 0])
# Strategy:
# - rating: Drop rows (can't analyze restaurants without ratings)
# - cost_for_two: Fill with median by rest_type
# - cuisines: Fill with "Not Specified"
# Drop rows with missing ratings
df = df.dropna(subset=['rating'])
# Fill missing costs with median by restaurant type
df['cost_for_two'] = df.groupby('rest_type')['cost_for_two'].transform(
lambda x: x.fillna(x.median())
)
# Fill missing cuisines
df['cuisines'] = df['cuisines'].fillna('Not Specified')
print(f"\nAfter cleaning: {df.shape[0]} rows remaining")Create Additional Columns
# Binary flags
df['accepts_online_orders'] = (df['online_order'] == 'Yes').astype(int)
df['table_booking'] = (df['book_table'] == 'Yes').astype(int)
# Price category
df['price_category'] = pd.cut(
df['cost_for_two'],
bins=[0, 300, 600, 1000, 10000],
labels=['Budget', 'Mid-Range', 'Premium', 'Luxury']
)
# Rating category
df['rating_category'] = pd.cut(
df['rating'],
bins=[0, 2.5, 3.5, 4.0, 5.0],
labels=['Poor', 'Average', 'Good', 'Excellent']
)
print(df[['name', 'rating', 'rating_category', 'cost_for_two', 'price_category']].head())⚠️ CheckpointQuiz error: Missing or invalid options array
Step 3 — Exploratory Data Analysis
Now that data is clean, let's answer our business questions.
Q1: Which locations have the highest-rated restaurants?
# Top 15 locations by average rating (min 100 restaurants)
location_stats = df.groupby('location').agg({
'rating': 'mean',
'name': 'count'
}).rename(columns={'name': 'restaurant_count'})
top_locations = location_stats[location_stats['restaurant_count'] >= 100].sort_values(
'rating', ascending=False
).head(15)
print(top_locations)
# Visualize
plt.figure(figsize=(12, 6))
sns.barplot(data=top_locations.reset_index(), x='location', y='rating', palette='viridis')
plt.xticks(rotation=45, ha='right')
plt.title('Top 15 Locations by Average Restaurant Rating', fontsize=14, fontweight='bold')
plt.xlabel('Location')
plt.ylabel('Average Rating')
plt.axhline(df['rating'].mean(), color='red', linestyle='--', label=f'City Average: {df["rating"].mean():.2f}')
plt.legend()
plt.tight_layout()
plt.show()Q2: Price vs Rating — Do expensive restaurants rate higher?
# Average rating by price category
price_rating = df.groupby('price_category')['rating'].agg(['mean', 'median', 'count'])
print(price_rating)
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Box plot
sns.boxplot(data=df, x='price_category', y='rating', palette='Set2', ax=axes[0])
axes[0].set_title('Rating Distribution by Price Category', fontweight='bold')
axes[0].set_xlabel('Price Category')
axes[0].set_ylabel('Rating')
# Scatter plot
axes[1].scatter(df['cost_for_two'], df['rating'], alpha=0.3, s=10)
axes[1].set_xlabel('Cost for Two (₹)')
axes[1].set_ylabel('Rating')
axes[1].set_title('Cost vs Rating — Scatter Plot', fontweight='bold')
axes[1].axhline(df['rating'].mean(), color='red', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
# Correlation
print(f"\nCorrelation between cost and rating: {df['cost_for_two'].corr(df['rating']):.3f}")Q3: Most popular cuisines
# Cuisines are comma-separated. Split and count.
from collections import Counter
all_cuisines = []
for cuisines_str in df['cuisines'].dropna():
cuisines = [c.strip() for c in cuisines_str.split(',')]
all_cuisines.extend(cuisines)
cuisine_counts = Counter(all_cuisines).most_common(15)
# Convert to DataFrame
cuisine_df = pd.DataFrame(cuisine_counts, columns=['Cuisine', 'Count'])
# Visualize
plt.figure(figsize=(12, 6))
sns.barplot(data=cuisine_df, x='Count', y='Cuisine', palette='magma')
plt.title('Top 15 Most Popular Cuisines in Bangalore', fontsize=14, fontweight='bold')
plt.xlabel('Number of Restaurants')
plt.ylabel('Cuisine')
plt.tight_layout()
plt.show()Q4: Online orders vs ratings
# Compare ratings: online vs no online
online_comparison = df.groupby('online_order')['rating'].agg(['mean', 'median', 'count'])
print(online_comparison)
# Statistical test (t-test)
from scipy import stats
online_yes = df[df['online_order'] == 'Yes']['rating']
online_no = df[df['online_order'] == 'No']['rating']
t_stat, p_value = stats.ttest_ind(online_yes, online_no)
print(f"\nT-test: t={t_stat:.3f}, p-value={p_value:.4f}")
if p_value < 0.05:
print("Significant difference! Restaurants with online orders have different ratings.")
# Visualize
plt.figure(figsize=(10, 6))
sns.violinplot(data=df, x='online_order', y='rating', palette='Set1')
plt.title('Rating Distribution: Online Orders vs No Online Orders', fontsize=14, fontweight='bold')
plt.xlabel('Accepts Online Orders')
plt.ylabel('Rating')
plt.tight_layout()
plt.show()Q5: Multi-factor analysis
# Rating by price category and online orders
pivot = df.pivot_table(
values='rating',
index='price_category',
columns='online_order',
aggfunc='mean'
)
print(pivot)
# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(pivot, annot=True, cmap='YlGnBu', fmt='.2f', linewidths=1)
plt.title('Average Rating: Price Category vs Online Orders', fontsize=14, fontweight='bold')
plt.xlabel('Accepts Online Orders')
plt.ylabel('Price Category')
plt.tight_layout()
plt.show()Step 4 — Key Insights and Conclusions
After completing the analysis, summarize findings for stakeholders.
Key Findings
1. Location Matters
- Premium neighborhoods (Koramangala, Indiranagar) have higher average ratings (4.0+)
- Emerging areas have more variability in quality
- Recommendation: Target expansion in proven high-rating locations
2. Price ≠ Quality
- Weak correlation between cost and rating (r ≈ 0.15)
- Budget restaurants can achieve excellent ratings with good execution
- Luxury doesn't guarantee satisfaction
3. Online Orders = Higher Ratings
- Restaurants accepting online orders: 3.95 average
- No online orders: 3.65 average
- Statistically significant (p < 0.001)
- Recommendation: Encourage online ordering adoption
4. North Indian Dominates
- North Indian cuisine is most common (8,000+ restaurants)
- Followed by Chinese, South Indian, Fast Food
- Niche cuisines (Italian, Continental) are underserved — opportunity?
5. Table Booking Correlates with Higher Ratings
- Table booking available: 4.05 average
- No table booking: 3.75 average
- Suggests restaurants investing in service get rewarded
Business Recommendations
For Zomato:
- Incentivize restaurants to enable online orders (data shows rating boost)
- Focus acquisition efforts on high-rating neighborhoods
- Help budget restaurants market their quality (price doesn't predict rating)
For Restaurant Owners:
- Enable online orders — it correlates with 0.3 higher rating
- Invest in table booking systems for dine-in restaurants
- Location strategy: operate in proven neighborhoods or differentiate in emerging areas
Complete Code — All in One Place
Here's the entire analysis in one runnable script:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from collections import Counter
# Setup
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
# Load data
df = pd.read_csv('zomato.csv')
# Clean rating
def clean_rating(rate):
if pd.isna(rate) or rate in ['NEW', '-']:
return np.nan
try:
return float(rate.split('/')[0].strip())
except:
return np.nan
df['rating'] = df['rate'].apply(clean_rating)
# Clean cost
df['cost_for_two'] = pd.to_numeric(
df['approx_cost(for two people)'].str.replace(',', ''),
errors='coerce'
)
# Handle missing
df = df.dropna(subset=['rating'])
df['cost_for_two'] = df.groupby('rest_type')['cost_for_two'].transform(
lambda x: x.fillna(x.median())
)
df['cuisines'] = df['cuisines'].fillna('Not Specified')
# Feature engineering
df['accepts_online_orders'] = (df['online_order'] == 'Yes').astype(int)
df['price_category'] = pd.cut(
df['cost_for_two'],
bins=[0, 300, 600, 1000, 10000],
labels=['Budget', 'Mid-Range', 'Premium', 'Luxury']
)
# Analysis 1: Top locations
location_stats = df.groupby('location').agg({
'rating': 'mean',
'name': 'count'
}).rename(columns={'name': 'count'})
top_locations = location_stats[location_stats['count'] >= 100].sort_values(
'rating', ascending=False
).head(15)
plt.figure(figsize=(12, 6))
sns.barplot(data=top_locations.reset_index(), x='location', y='rating', palette='viridis')
plt.xticks(rotation=45, ha='right')
plt.title('Top Locations by Average Rating')
plt.tight_layout()
plt.savefig('top_locations.png', dpi=300)
plt.show()
# Analysis 2: Price vs Rating
print(f"Cost-Rating Correlation: {df['cost_for_two'].corr(df['rating']):.3f}")
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='price_category', y='rating', palette='Set2')
plt.title('Rating by Price Category')
plt.savefig('price_rating.png', dpi=300)
plt.show()
# Analysis 3: Online orders
online_yes = df[df['online_order'] == 'Yes']['rating']
online_no = df[df['online_order'] == 'No']['rating']
t_stat, p_value = stats.ttest_ind(online_yes, online_no)
print(f"\nOnline Orders Impact:")
print(f" With online: {online_yes.mean():.2f}")
print(f" Without online: {online_no.mean():.2f}")
print(f" T-test p-value: {p_value:.4f}")
# Save cleaned data
df.to_csv('zomato_cleaned.csv', index=False)
print("\nCleaned data saved to: zomato_cleaned.csv")⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}
Extension Challenges
Ready to take this project further? Try these:
1. Cuisine Combination Analysis
- Which cuisine combinations (e.g., "Chinese, North Indian") are most popular?
- Do multi-cuisine restaurants rate higher or lower than specialists?
- Create a network graph showing cuisine co-occurrence
2. Location Clustering
- Group similar locations using restaurant features (avg rating, cost, cuisine mix)
- Use K-means clustering to identify "restaurant neighborhood archetypes"
- Visualize clusters on a map (if you add latitude/longitude data)
3. Predictive Modeling
- Build a regression model to predict restaurant rating from features (cost, location, online orders, cuisines)
- Which features matter most? (Use feature importance from Random Forest)
- Can you predict success for a new restaurant concept?
4. Time-Series Analysis
- If you can find historical Zomato data (ratings over time), analyze trends
- Do restaurants decline in rating after initial hype?
- Identify restaurants improving vs declining
5. Sentiment Analysis
- Scrape restaurant reviews (check Zomato's terms of service)
- Use NLP to analyze review sentiment
- Does text sentiment correlate with numeric ratings?
Add these to your portfolio to stand out!