Sales Data Analysis Project

Sales Data Analysis

Project Overview

In this project, you will analyze a company's sales data to find:

Which products sell the most?
Which months have highest sales?
Which cities generate most revenue?
What patterns exist in the data?

Skills you'll practice:

Data loading and cleaning
Exploratory data analysis
Data visualization
Drawing business insights

Step 1: Setup and Import Libraries

First, let's import everything we need:

code.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Make plots look nice
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print("Libraries imported successfully!")

Step 2: Create Sample Data

For this project, we'll create realistic sales data:

code.py

# Create sample sales data
np.random.seed(42)

# Generate 1000 sales records
n_records = 1000

data = {
    'Order_ID': range(1001, 1001 + n_records),
    'Product': np.random.choice(
        ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Charger', 'Cable'],
        n_records
    ),
    'Quantity': np.random.randint(1, 5, n_records),
    'Price': np.random.choice([999, 699, 499, 149, 29, 15], n_records),
    'City': np.random.choice(
        ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
        n_records
    ),
    'Month': np.random.choice(
        ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
         'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],
        n_records
    )
}

df = pd.DataFrame(data)

# Calculate total sale amount
df['Total'] = df['Quantity'] * df['Price']

print("Data created successfully!")
print(f"Total records: {len(df)}")

Step 3: Explore the Data

Let's understand what we have:

code.py

# First look at the data
print("=== First 5 Rows ===")
print(df.head())

Order_ID	Product	Quantity	Price	City	Month	Total
1001	Laptop	2	999	Chicago	Mar	1998
1002	Phone	1	699	Houston	Jul	699
1003	Cable	3	15	New York	Jan	45
1004	Headphones	2	149	Phoenix	Nov	298
1005	Tablet	1	499	Los Angeles	May	499

code.py

# Data info
print("\n=== Data Info ===")
print(df.info())

Column	Type	Non-Null
Order_ID	int64	1000
Product	object	1000
Quantity	int64	1000
Price	int64	1000
City	object	1000
Month	object	1000
Total	int64	1000

code.py

# Basic statistics
print("\n=== Statistics ===")
print(df.describe())

Stat	Quantity	Price	Total
count	1000	1000	1000
mean	2.5	398	995
std	1.1	356	1124
min	1	15	15
max	4	999	3996

Step 4: Data Cleaning

Check for issues:

code.py

# Check for missing values
print("=== Missing Values ===")
print(df.isnull().sum())

Column	Missing
Order_ID	0
Product	0
Quantity	0
Price	0
City	0
Month	0
Total	0

code.py

# Check for duplicates
duplicates = df.duplicated().sum()
print(f"\nDuplicate rows: {duplicates}")

# Check unique values
print("\n=== Unique Values ===")
print(f"Products: {df['Product'].nunique()}")
print(f"Cities: {df['City'].nunique()}")
print(f"Months: {df['Month'].nunique()}")

Our data is clean! No missing values or duplicates.

Step 5: Sales by Product

Question: Which products sell the most?

code.py

# Total sales by product
product_sales = df.groupby('Product')['Total'].sum().sort_values(ascending=False)

print("=== Sales by Product ===")
print(product_sales)

Product	Total Sales ($)
Laptop	312,687
Phone	245,790
Tablet	178,143
Headphones	52,318
Charger	10,324
Cable	5,670

code.py

# Visualize
plt.figure(figsize=(10, 6))
product_sales.plot(kind='bar', color='steelblue', edgecolor='black')
plt.title('Total Sales by Product', fontsize=14, fontweight='bold')
plt.xlabel('Product')
plt.ylabel('Total Sales ($)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Sales by Product

Insight: Laptops generate the most revenue, followed by Phones and Tablets.

Step 6: Sales by City

Question: Which cities generate most revenue?

code.py

# Sales by city
city_sales = df.groupby('City')['Total'].sum().sort_values(ascending=False)

print("=== Sales by City ===")
print(city_sales)

City	Total Sales ($)
New York	198,450
Los Angeles	185,230
Chicago	167,890
Houston	152,340
Phoenix	100,022

code.py

# Pie chart for city distribution
plt.figure(figsize=(8, 8))
plt.pie(city_sales, labels=city_sales.index, autopct='%1.1f%%',
        colors=sns.color_palette('pastel'), startangle=90)
plt.title('Sales Distribution by City', fontsize=14, fontweight='bold')
plt.show()

Sales by City

Insight: New York and Los Angeles are the top markets.

Step 7: Monthly Sales Trend

Question: Which months have highest sales?

code.py

# Define month order
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

# Sales by month
monthly_sales = df.groupby('Month')['Total'].sum()
monthly_sales = monthly_sales.reindex(month_order)

print("=== Monthly Sales ===")
print(monthly_sales)

Month	Total Sales ($)
Jan	62,340
Feb	58,120
Mar	71,450
Apr	65,890
May	69,230
Jun	72,100
Jul	78,450
Aug	75,320
Sep	68,900
Oct	71,230
Nov	85,670
Dec	95,232

code.py

# Line chart for monthly trend
plt.figure(figsize=(12, 6))
plt.plot(monthly_sales.index, monthly_sales.values, marker='o',
         linewidth=2, markersize=8, color='coral')
plt.fill_between(monthly_sales.index, monthly_sales.values, alpha=0.3, color='coral')
plt.title('Monthly Sales Trend', fontsize=14, fontweight='bold')
plt.xlabel('Month')
plt.ylabel('Total Sales ($)')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Monthly Sales Trend

Insight: Sales peak in November and December (holiday season).

Step 8: Product Performance by City

Question: Do different cities prefer different products?

code.py

# Create pivot table
pivot = df.pivot_table(values='Total', index='City',
                       columns='Product', aggfunc='sum')

print("=== Sales: City vs Product ===")
print(pivot)

City	Cable	Charger	Headphones	Laptop	Phone	Tablet
Chicago	1,020	2,030	10,430	58,320	52,100	35,990
Houston	1,200	1,890	9,870	62,450	45,320	31,610
Los Angeles	1,350	2,450	11,230	68,900	55,800	45,500
New York	1,500	2,100	12,450	72,340	58,230	40,830
Phoenix	600	1,854	8,338	50,677	34,340	24,213

code.py

# Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(pivot, annot=True, fmt='.0f', cmap='YlOrRd')
plt.title('Sales Heatmap: City vs Product', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Sales Heatmap

Insight: Laptops sell best across all cities. New York leads in most categories.

Step 9: Order Size Analysis

Question: What's the typical order size?

code.py

# Order quantity distribution
print("=== Quantity Distribution ===")
print(df['Quantity'].value_counts().sort_index())

Quantity	Count
1	248
2	256
3	251
4	245

code.py

# Histogram
plt.figure(figsize=(8, 5))
df['Quantity'].hist(bins=4, edgecolor='black', color='lightgreen')
plt.title('Order Quantity Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Quantity')
plt.ylabel('Number of Orders')
plt.tight_layout()
plt.show()

Insight: Order quantities are evenly distributed (1-4 items per order).

Step 10: Key Metrics Summary

Let's create a summary dashboard:

code.py

# Calculate key metrics
total_revenue = df['Total'].sum()
total_orders = len(df)
avg_order_value = df['Total'].mean()
top_product = df.groupby('Product')['Total'].sum().idxmax()
top_city = df.groupby('City')['Total'].sum().idxmax()
best_month = df.groupby('Month')['Total'].sum().idxmax()

print("=" * 50)
print("        SALES DASHBOARD SUMMARY")
print("=" * 50)
print(f"Total Revenue:      ${total_revenue:,.2f}")
print(f"Total Orders:       {total_orders:,}")
print(f"Avg Order Value:    ${avg_order_value:,.2f}")
print(f"Top Product:        {top_product}")
print(f"Top City:           {top_city}")
print(f"Best Month:         {best_month}")
print("=" * 50)

Metric	Value
Total Revenue	$804,932
Total Orders	1,000
Avg Order Value	$804.93
Top Product	Laptop
Top City	New York
Best Month	December

Dashboard Summary

Step 11: Save Your Analysis

code.py

# Save cleaned data
df.to_csv('sales_analysis_results.csv', index=False)

# Save summary to text file
with open('sales_summary.txt', 'w') as f:
    f.write("SALES ANALYSIS SUMMARY\n")
    f.write("=" * 40 + "\n")
    f.write(f"Total Revenue: ${total_revenue:,.2f}\n")
    f.write(f"Total Orders: {total_orders}\n")
    f.write(f"Top Product: {top_product}\n")
    f.write(f"Top City: {top_city}\n")
    f.write(f"Best Month: {best_month}\n")

print("Files saved successfully!")

Business Recommendations

Based on our analysis:

Finding	Recommendation
Laptops = highest revenue	Focus marketing on laptop promotions
Dec & Nov = peak sales	Plan inventory for holiday season
New York = top market	Consider expanding operations there
Low cable/charger sales	Bundle accessories with main products
Even order quantity	No need to push larger orders

Complete Code

Here's the full code in one place:

code.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Setup
plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

# Create data
n_records = 1000
data = {
    'Order_ID': range(1001, 1001 + n_records),
    'Product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Headphones', 'Charger', 'Cable'], n_records),
    'Quantity': np.random.randint(1, 5, n_records),
    'Price': np.random.choice([999, 699, 499, 149, 29, 15], n_records),
    'City': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], n_records),
    'Month': np.random.choice(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], n_records)
}
df = pd.DataFrame(data)
df['Total'] = df['Quantity'] * df['Price']

# Analysis
print("Top 5 Products by Revenue:")
print(df.groupby('Product')['Total'].sum().sort_values(ascending=False))

print("\nTop Cities by Revenue:")
print(df.groupby('City')['Total'].sum().sort_values(ascending=False))

print("\nMonthly Sales:")
print(df.groupby('Month')['Total'].sum())

# Visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Product sales
df.groupby('Product')['Total'].sum().sort_values().plot(kind='barh', ax=axes[0,0], color='steelblue')
axes[0,0].set_title('Sales by Product')

# City sales
df.groupby('City')['Total'].sum().plot(kind='pie', ax=axes[0,1], autopct='%1.1f%%')
axes[0,1].set_title('Sales by City')

# Monthly trend
month_order = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
monthly = df.groupby('Month')['Total'].sum().reindex(month_order)
axes[1,0].plot(monthly.index, monthly.values, marker='o', color='coral')
axes[1,0].set_title('Monthly Sales Trend')
axes[1,0].tick_params(axis='x', rotation=45)

# Heatmap
pivot = df.pivot_table(values='Total', index='City', columns='Product', aggfunc='sum')
sns.heatmap(pivot, ax=axes[1,1], cmap='YlOrRd', annot=True, fmt='.0f')
axes[1,1].set_title('City vs Product Sales')

plt.tight_layout()
plt.savefig('sales_dashboard.png', dpi=300)
plt.show()

print("\nProject Complete!")

What You Learned

Loading and exploring data with Pandas
Cleaning and validating data
Grouping and aggregating data
Creating various visualizations
Drawing business insights from data
Saving results and reports

Congratulations! You've completed your first data analysis project!

What's Next?

Try the Customer Churn Prediction project to learn machine learning.