Python Data Analysis Project — Zomato Restaurants | DataPath

🎯

Project Overview — What We'll Build

In this project, you'll analyze real restaurant data from Zomato to answer business questions:

Key Questions We'll Answer:

Which cities have the highest-rated restaurants?
What's the relationship between price range and rating?
Which cuisines are most popular in each city?
Do restaurants accepting online orders have higher ratings?
What factors predict a restaurant's success?

Skills You'll Practice:

Loading and exploring messy CSV data
Handling missing values and data type issues
Cleaning text data (restaurant names, cuisines)
Grouping and aggregating by multiple dimensions
Creating visualizations to communicate insights
Drawing business conclusions from data

Tools:

Pandas for data manipulation
NumPy for numerical operations
Matplotlib and Seaborn for visualization

Dataset Information

Source: Kaggle — Zomato Bangalore Restaurants

Size: ~56,000 restaurants from Bangalore

Columns:

name — Restaurant name
online_order — Accepts online orders (Yes/No)
book_table — Table booking available (Yes/No)
rate — Average rating (e.g., "4.1/5", "3.8 /5")
votes — Number of votes/reviews
location — Area/neighborhood
rest_type — Restaurant type (Casual Dining, Cafe, etc.)
cuisines — Cuisines offered (comma-separated)
approx_cost(for two people) — Estimated cost for two
listed_in(type) — Meal type (Delivery, Dine-out, etc.)

Download: Get the dataset from Kaggle or use this direct CSV link (example mirror).

📥

Step 1 — Setup and Load Data

Install Required Libraries

$ terminalBash

pip install pandas numpy matplotlib seaborn

Load the Dataset

code.pyPython

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visualization settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Load data
df = pd.read_csv('zomato.csv')

# First look at the data
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")

Expected Output:

                      name online_order book_table   rate  votes  \
0        Jalsa              Yes          Yes        4.1/5    775
1  Spice Elephant            Yes          No        4.1/5    787
2       San Churro            Yes          No        3.8/5    918
3    Addhuri Udupi Bhojana   No           No        3.7/5     88
4        Grand Village        No           No        3.8/5    166

        location       rest_type        cuisines  approx_cost(for two people)  \
0       Banashankari  Casual Dining  North Indian, Mughlai, Chinese              800
1       Banashankari  Casual Dining    North Indian, Chinese, Biryani          800
2       Banashankari    Cafe, Casual Dining            Cafe, Mexican          800
3       Banashankari    Quick Bites        South Indian, North Indian          300
4       Basavanagudi    Casual Dining        North Indian, Rajasthani          600

  listed_in(type)
0        Buffet
1        Buffet
2        Buffet
3        Delivery
4        Dine-out

Dataset shape: (51717, 11)

Get Initial Insights

code.pyPython

# Data info
print(df.info())

# Check for missing values
print("\nMissing values:")
print(df.isnull().sum())

# Statistical summary
print("\n" + "="*50)
print(df.describe())

🧹

Step 2 — Clean the Data

Real-world data is messy. Let's clean it systematically.

Fix the Rating Column

The rate column has formats like "4.1/5", "NEW", "-", "3.8 /5". Let's standardize it:

code.pyPython

# View unique rating formats
print(df['rate'].value_counts().head(10))

# Clean ratings: extract numeric value
def clean_rating(rate):
    if pd.isna(rate):
        return np.nan
    if rate == 'NEW' or rate == '-':
        return np.nan
    # Extract numeric part (e.g., "4.1/5" → 4.1)
    try:
        return float(rate.split('/')[0].strip())
    except:
        return np.nan

df['rating'] = df['rate'].apply(clean_rating)

# Check the result
print(f"\nRatings cleaned. Sample values:")
print(df[['rate', 'rating']].head(10))

# Drop original rate column
df = df.drop(columns=['rate'])

Fix the Cost Column

code.pyPython

# Current format: "800" (string with commas for large values)
print(df['approx_cost(for two people)'].value_counts().head())

# Clean cost: remove commas, convert to numeric
df['cost_for_two'] = pd.to_numeric(
    df['approx_cost(for two people)'].str.replace(',', ''),
    errors='coerce'
)

# Drop original column
df = df.drop(columns=['approx_cost(for two people)'])

# Check for outliers
print(f"\nCost statistics:")
print(df['cost_for_two'].describe())

Handle Missing Values

code.pyPython

# Missing value summary
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing': missing,
    'Percentage': missing_pct
}).sort_values('Missing', ascending=False)

print(missing_df[missing_df['Missing'] > 0])

# Strategy:
# - rating: Drop rows (can't analyze restaurants without ratings)
# - cost_for_two: Fill with median by rest_type
# - cuisines: Fill with "Not Specified"

# Drop rows with missing ratings
df = df.dropna(subset=['rating'])

# Fill missing costs with median by restaurant type
df['cost_for_two'] = df.groupby('rest_type')['cost_for_two'].transform(
    lambda x: x.fillna(x.median())
)

# Fill missing cuisines
df['cuisines'] = df['cuisines'].fillna('Not Specified')

print(f"\nAfter cleaning: {df.shape[0]} rows remaining")

Create Additional Columns

code.pyPython

# Binary flags
df['accepts_online_orders'] = (df['online_order'] == 'Yes').astype(int)
df['table_booking'] = (df['book_table'] == 'Yes').astype(int)

# Price category
df['price_category'] = pd.cut(
    df['cost_for_two'],
    bins=[0, 300, 600, 1000, 10000],
    labels=['Budget', 'Mid-Range', 'Premium', 'Luxury']
)

# Rating category
df['rating_category'] = pd.cut(
    df['rating'],
    bins=[0, 2.5, 3.5, 4.0, 5.0],
    labels=['Poor', 'Average', 'Good', 'Excellent']
)

print(df[['name', 'rating', 'rating_category', 'cost_for_two', 'price_category']].head())

⚠️ CheckpointQuiz error: Missing or invalid options array

🔍

Step 3 — Exploratory Data Analysis

Now that data is clean, let's answer our business questions.

Q1: Which locations have the highest-rated restaurants?

code.pyPython

# Top 15 locations by average rating (min 100 restaurants)
location_stats = df.groupby('location').agg({
    'rating': 'mean',
    'name': 'count'
}).rename(columns={'name': 'restaurant_count'})

top_locations = location_stats[location_stats['restaurant_count'] >= 100].sort_values(
    'rating', ascending=False
).head(15)

print(top_locations)

# Visualize
plt.figure(figsize=(12, 6))
sns.barplot(data=top_locations.reset_index(), x='location', y='rating', palette='viridis')
plt.xticks(rotation=45, ha='right')
plt.title('Top 15 Locations by Average Restaurant Rating', fontsize=14, fontweight='bold')
plt.xlabel('Location')
plt.ylabel('Average Rating')
plt.axhline(df['rating'].mean(), color='red', linestyle='--', label=f'City Average: {df["rating"].mean():.2f}')
plt.legend()
plt.tight_layout()
plt.show()

Q2: Price vs Rating — Do expensive restaurants rate higher?

code.pyPython

# Average rating by price category
price_rating = df.groupby('price_category')['rating'].agg(['mean', 'median', 'count'])
print(price_rating)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
sns.boxplot(data=df, x='price_category', y='rating', palette='Set2', ax=axes[0])
axes[0].set_title('Rating Distribution by Price Category', fontweight='bold')
axes[0].set_xlabel('Price Category')
axes[0].set_ylabel('Rating')

# Scatter plot
axes[1].scatter(df['cost_for_two'], df['rating'], alpha=0.3, s=10)
axes[1].set_xlabel('Cost for Two (₹)')
axes[1].set_ylabel('Rating')
axes[1].set_title('Cost vs Rating — Scatter Plot', fontweight='bold')
axes[1].axhline(df['rating'].mean(), color='red', linestyle='--', alpha=0.7)

plt.tight_layout()
plt.show()

# Correlation
print(f"\nCorrelation between cost and rating: {df['cost_for_two'].corr(df['rating']):.3f}")

Q3: Most popular cuisines

code.pyPython

# Cuisines are comma-separated. Split and count.
from collections import Counter

all_cuisines = []
for cuisines_str in df['cuisines'].dropna():
    cuisines = [c.strip() for c in cuisines_str.split(',')]
    all_cuisines.extend(cuisines)

cuisine_counts = Counter(all_cuisines).most_common(15)

# Convert to DataFrame
cuisine_df = pd.DataFrame(cuisine_counts, columns=['Cuisine', 'Count'])

# Visualize
plt.figure(figsize=(12, 6))
sns.barplot(data=cuisine_df, x='Count', y='Cuisine', palette='magma')
plt.title('Top 15 Most Popular Cuisines in Bangalore', fontsize=14, fontweight='bold')
plt.xlabel('Number of Restaurants')
plt.ylabel('Cuisine')
plt.tight_layout()
plt.show()

Q4: Online orders vs ratings

code.pyPython

# Compare ratings: online vs no online
online_comparison = df.groupby('online_order')['rating'].agg(['mean', 'median', 'count'])
print(online_comparison)

# Statistical test (t-test)
from scipy import stats

online_yes = df[df['online_order'] == 'Yes']['rating']
online_no = df[df['online_order'] == 'No']['rating']

t_stat, p_value = stats.ttest_ind(online_yes, online_no)
print(f"\nT-test: t={t_stat:.3f}, p-value={p_value:.4f}")
if p_value < 0.05:
    print("Significant difference! Restaurants with online orders have different ratings.")

# Visualize
plt.figure(figsize=(10, 6))
sns.violinplot(data=df, x='online_order', y='rating', palette='Set1')
plt.title('Rating Distribution: Online Orders vs No Online Orders', fontsize=14, fontweight='bold')
plt.xlabel('Accepts Online Orders')
plt.ylabel('Rating')
plt.tight_layout()
plt.show()

Q5: Multi-factor analysis

code.pyPython

# Rating by price category and online orders
pivot = df.pivot_table(
    values='rating',
    index='price_category',
    columns='online_order',
    aggfunc='mean'
)

print(pivot)

# Heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(pivot, annot=True, cmap='YlGnBu', fmt='.2f', linewidths=1)
plt.title('Average Rating: Price Category vs Online Orders', fontsize=14, fontweight='bold')
plt.xlabel('Accepts Online Orders')
plt.ylabel('Price Category')
plt.tight_layout()
plt.show()

💡

Step 4 — Key Insights and Conclusions

After completing the analysis, summarize findings for stakeholders.

Key Findings

1. Location Matters

Premium neighborhoods (Koramangala, Indiranagar) have higher average ratings (4.0+)
Emerging areas have more variability in quality
Recommendation: Target expansion in proven high-rating locations

2. Price ≠ Quality

Weak correlation between cost and rating (r ≈ 0.15)
Budget restaurants can achieve excellent ratings with good execution
Luxury doesn't guarantee satisfaction

3. Online Orders = Higher Ratings

Restaurants accepting online orders: 3.95 average
No online orders: 3.65 average
Statistically significant (p < 0.001)
Recommendation: Encourage online ordering adoption

4. North Indian Dominates

North Indian cuisine is most common (8,000+ restaurants)
Followed by Chinese, South Indian, Fast Food
Niche cuisines (Italian, Continental) are underserved — opportunity?

5. Table Booking Correlates with Higher Ratings

Table booking available: 4.05 average
No table booking: 3.75 average
Suggests restaurants investing in service get rewarded

Business Recommendations

For Zomato:

Incentivize restaurants to enable online orders (data shows rating boost)
Focus acquisition efforts on high-rating neighborhoods
Help budget restaurants market their quality (price doesn't predict rating)

For Restaurant Owners:

Enable online orders — it correlates with 0.3 higher rating
Invest in table booking systems for dine-in restaurants
Location strategy: operate in proven neighborhoods or differentiate in emerging areas

📝

Complete Code — All in One Place

Here's the entire analysis in one runnable script:

code.pyPython

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from collections import Counter

# Setup
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Load data
df = pd.read_csv('zomato.csv')

# Clean rating
def clean_rating(rate):
    if pd.isna(rate) or rate in ['NEW', '-']:
        return np.nan
    try:
        return float(rate.split('/')[0].strip())
    except:
        return np.nan

df['rating'] = df['rate'].apply(clean_rating)

# Clean cost
df['cost_for_two'] = pd.to_numeric(
    df['approx_cost(for two people)'].str.replace(',', ''),
    errors='coerce'
)

# Handle missing
df = df.dropna(subset=['rating'])
df['cost_for_two'] = df.groupby('rest_type')['cost_for_two'].transform(
    lambda x: x.fillna(x.median())
)
df['cuisines'] = df['cuisines'].fillna('Not Specified')

# Feature engineering
df['accepts_online_orders'] = (df['online_order'] == 'Yes').astype(int)
df['price_category'] = pd.cut(
    df['cost_for_two'],
    bins=[0, 300, 600, 1000, 10000],
    labels=['Budget', 'Mid-Range', 'Premium', 'Luxury']
)

# Analysis 1: Top locations
location_stats = df.groupby('location').agg({
    'rating': 'mean',
    'name': 'count'
}).rename(columns={'name': 'count'})
top_locations = location_stats[location_stats['count'] >= 100].sort_values(
    'rating', ascending=False
).head(15)

plt.figure(figsize=(12, 6))
sns.barplot(data=top_locations.reset_index(), x='location', y='rating', palette='viridis')
plt.xticks(rotation=45, ha='right')
plt.title('Top Locations by Average Rating')
plt.tight_layout()
plt.savefig('top_locations.png', dpi=300)
plt.show()

# Analysis 2: Price vs Rating
print(f"Cost-Rating Correlation: {df['cost_for_two'].corr(df['rating']):.3f}")

plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='price_category', y='rating', palette='Set2')
plt.title('Rating by Price Category')
plt.savefig('price_rating.png', dpi=300)
plt.show()

# Analysis 3: Online orders
online_yes = df[df['online_order'] == 'Yes']['rating']
online_no = df[df['online_order'] == 'No']['rating']
t_stat, p_value = stats.ttest_ind(online_yes, online_no)

print(f"\nOnline Orders Impact:")
print(f"  With online: {online_yes.mean():.2f}")
print(f"  Without online: {online_no.mean():.2f}")
print(f"  T-test p-value: {p_value:.4f}")

# Save cleaned data
df.to_csv('zomato_cleaned.csv', index=False)
print("\nCleaned data saved to: zomato_cleaned.csv")

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}

Extension Challenges

Ready to take this project further? Try these:

1. Cuisine Combination Analysis

Which cuisine combinations (e.g., "Chinese, North Indian") are most popular?
Do multi-cuisine restaurants rate higher or lower than specialists?
Create a network graph showing cuisine co-occurrence

2. Location Clustering

Group similar locations using restaurant features (avg rating, cost, cuisine mix)
Use K-means clustering to identify "restaurant neighborhood archetypes"
Visualize clusters on a map (if you add latitude/longitude data)

3. Predictive Modeling

Build a regression model to predict restaurant rating from features (cost, location, online orders, cuisines)
Which features matter most? (Use feature importance from Random Forest)
Can you predict success for a new restaurant concept?

4. Time-Series Analysis

If you can find historical Zomato data (ratings over time), analyze trends
Do restaurants decline in rating after initial hype?
Identify restaurants improving vs declining

5. Sentiment Analysis

Scrape restaurant reviews (check Zomato's terms of service)
Use NLP to analyze review sentiment
Does text sentiment correlate with numeric ratings?

Add these to your portfolio to stand out!

Python Project — Zomato Restaurant Analysis

Project Overview — What We'll Build

Dataset Information

Step 1 — Setup and Load Data

Install Required Libraries

Load the Dataset

Get Initial Insights

Step 2 — Clean the Data

Fix the Rating Column

Fix the Cost Column

Handle Missing Values

Create Additional Columns

Step 3 — Exploratory Data Analysis

Q1: Which locations have the highest-rated restaurants?

Q2: Price vs Rating — Do expensive restaurants rate higher?

Q3: Most popular cuisines

Q4: Online orders vs ratings

Q5: Multi-factor analysis

Step 4 — Key Insights and Conclusions

Key Findings

Business Recommendations

Complete Code — All in One Place

Extension Challenges