Why Python for Data Analysis?
Python has become the dominant language for data analysis — and for good reason. While SQL is perfect for querying databases, Python excels at everything that comes after: cleaning messy data, performing complex transformations, statistical analysis, and creating visualizations.
At companies like Swiggy, PhonePe, and Flipkart, data analysts use SQL to extract data and Python to analyze it. Here's why:
Flexibility: Python handles any data format — CSV, JSON, Excel, APIs, databases — without breaking a sweat. Need to merge data from three different sources with different structures? Python makes it straightforward.
Libraries: The Python ecosystem offers specialized tools for every task:
- Pandas: DataFrames for tabular data manipulation (like Excel on steroids)
- NumPy: Fast numerical computations and array operations
- Matplotlib/Seaborn: Publication-quality visualizations
- Scikit-learn: Machine learning for predictions and clustering
Reproducibility: Unlike Excel, Python scripts are reproducible. Write your analysis once, run it on updated data anytime. No manual copy-pasting, no "I forgot which cells I edited" moments.
SQL is like a librarian who helps you find books. Python is what you do with those books — reading, highlighting, summarizing, and connecting ideas across multiple sources to write your own analysis.
Setting Up Your Python Environment
Before writing code, you need Python installed and a way to run your analysis. Here's the quickest path:
Install Python with Anaconda
Anaconda is a Python distribution that includes Python + all major data science libraries pre-installed. Download it from anaconda.com/download — it's free.
Once installed, you can run Python in three ways:
1. Jupyter Notebook (recommended for beginners):
jupyter notebookOpens in your browser. Combines code, output, and notes in one document. Perfect for exploratory analysis.
2. VS Code (recommended for serious projects):
Install VS Code, add the Python extension, and run .py files directly. Better for scripts you'll reuse.
3. Command line (quick tests):
python
>>> 2 + 2
4Essential Libraries to Install
If you're not using Anaconda, install these manually:
pip install pandas numpy matplotlib seaborn jupyterVerify your installation:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
print(f"Pandas: {pd.__version__}")
print(f"NumPy: {np.__version__}")
# Should print version numbers without errorsPro Tip: Create a dedicated folder for your data projects (e.g., C:\data-projects\ or ~/data-projects/). Store your datasets, notebooks, and scripts there. Keep it organized from day one.
⚠️ CheckpointQuiz error: Missing or invalid options array
Your First Data Analysis in Python
Let's analyze real data from scratch. We'll use a simple Flipkart orders dataset to answer: Which city generates the most revenue?
Step 1: Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltStep 2: Load Data
# Load CSV file
orders = pd.read_csv('flipkart_orders.csv')
# View first 5 rows
print(orders.head())
# Check shape: (rows, columns)
print(f"Dataset has {orders.shape[0]} rows and {orders.shape[1]} columns")Sample output:
order_id city amount order_date
0 1001 Mumbai 2500 2026-01-15
1 1002 Delhi 3200 2026-01-16
2 1003 Mumbai 1800 2026-01-16
3 1004 Bangalore 4100 2026-01-17
4 1005 Chennai 2900 2026-01-18
Dataset has 50000 rows and 4 columns
Step 3: Explore the Data
# Get summary statistics
print(orders.describe())
# Check for missing values
print(orders.isnull().sum())
# Check data types
print(orders.dtypes)Step 4: Analyze Revenue by City
# Group by city and sum revenue
revenue_by_city = orders.groupby('city')['amount'].sum().sort_values(ascending=False)
print(revenue_by_city)Output:
city
Mumbai 42350000
Delhi 38200000
Bangalore 35600000
Chennai 28900000
Name: amount, dtype: int64
Step 5: Visualize Results
# Create bar chart
revenue_by_city.plot(kind='bar', color='steelblue', figsize=(10, 6))
plt.title('Revenue by City', fontsize=16)
plt.xlabel('City', fontsize=12)
plt.ylabel('Revenue (₹)', fontsize=12)
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()That's it! Five steps to go from raw data to insights with visualization. This workflow — load, explore, analyze, visualize — is what data analysts do daily.
Key Python Concepts for Analysts
You don't need to master all of Python — just the parts relevant to data work. Here are the essentials:
1. Variables and Data Types
# Numbers
revenue = 125000
growth_rate = 0.23
# Strings
city = "Mumbai"
status = 'delivered'
# Booleans
is_active = True
has_discount = False
# Lists (ordered collection)
cities = ['Mumbai', 'Delhi', 'Bangalore']
revenues = [42000, 38000, 35000]
# Dictionaries (key-value pairs)
order = {
'order_id': 1001,
'city': 'Mumbai',
'amount': 2500
}2. Indexing and Slicing
cities = ['Mumbai', 'Delhi', 'Bangalore', 'Chennai', 'Hyderabad']
# Access by index (starts at 0)
print(cities[0]) # Mumbai
print(cities[-1]) # Hyderabad (last element)
# Slicing: [start:end] (end not included)
print(cities[1:3]) # ['Delhi', 'Bangalore']
print(cities[:2]) # First 2: ['Mumbai', 'Delhi']
print(cities[2:]) # From index 2 onward: ['Bangalore', 'Chennai', 'Hyderabad']3. Loops and Conditionals
# For loop
cities = ['Mumbai', 'Delhi', 'Bangalore']
for city in cities:
print(f"Analyzing {city} data...")
# If-elif-else
amount = 2500
if amount > 3000:
category = 'High'
elif amount > 1000:
category = 'Medium'
else:
category = 'Low'4. Functions (Reusable Code Blocks)
def calculate_gst(amount, rate=0.18):
"""Calculate GST on an amount"""
gst = amount * rate
total = amount + gst
return total
# Use the function
order_amount = 2500
final_amount = calculate_gst(order_amount)
print(f"Total with GST: ₹{final_amount}") # ₹2950.05. List Comprehensions (Concise Loops)
# Long way
amounts = [1000, 1500, 2000, 2500]
with_gst = []
for amount in amounts:
with_gst.append(amount * 1.18)
# Short way (list comprehension)
with_gst = [amount * 1.18 for amount in amounts]
print(with_gst) # [1180.0, 1770.0, 2360.0, 2950.0]
# With condition
high_value = [amount for amount in amounts if amount > 1500]
print(high_value) # [2000, 2500]For Analysts: You'll use loops less often than you think. Pandas has vectorized operations that apply functions to entire columns at once — much faster than looping row by row.
Introduction to Pandas DataFrames
Pandas is the core library for data analysis in Python. Its central data structure is the DataFrame — a table with rows and columns, like a spreadsheet or SQL result set.
Creating DataFrames
import pandas as pd
# From a dictionary
data = {
'city': ['Mumbai', 'Delhi', 'Bangalore', 'Chennai'],
'orders': [1200, 1100, 950, 800],
'revenue': [42000, 38000, 35000, 28000]
}
df = pd.DataFrame(data)
print(df)Output:
city orders revenue
0 Mumbai 1200 42000
1 Delhi 1100 38000
2 Bangalore 950 35000
3 Chennai 800 28000
Basic DataFrame Operations
# Select a column (returns a Series)
print(df['city'])
# Select multiple columns (returns a DataFrame)
print(df[['city', 'revenue']])
# Filter rows
high_revenue = df[df['revenue'] > 35000]
print(high_revenue)
# Add a new column
df['avg_order_value'] = df['revenue'] / df['orders']
print(df)Reading from Files
# CSV files (most common)
df = pd.read_csv('orders.csv')
# Excel files
df = pd.read_excel('sales_data.xlsx', sheet_name='Q1_Sales')
# From SQL database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM orders', conn)
# JSON files (APIs often return JSON)
df = pd.read_json('api_response.json')Essential DataFrame Methods
# View data
df.head(10) # First 10 rows
df.tail(5) # Last 5 rows
df.sample(10) # Random 10 rows
# Get info
df.shape # (rows, columns)
df.info() # Column types and memory usage
df.describe() # Statistical summary of numeric columns
# Column operations
df.columns # List of column names
df.dtypes # Data type of each column
df['amount'].sum() # Sum of amount column
df['amount'].mean() # Average
df['amount'].median() # Median
df['amount'].max() # Maximum valueIf you're comfortable with SQL, think of a DataFrame as the result of a SELECT query that you can continue manipulating in memory. df.groupby('city') is like GROUP BY city, df[df['amount'] > 1000] is like WHERE amount > 1000.
⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}