What is Pandas?
Pandas is Python's most powerful library for data manipulation and analysis. Think of it as Excel, but:
✅ 100x faster for large datasets ✅ Handles millions of rows without crashing ✅ Fully automated (no manual clicking) ✅ More powerful operations than Excel formulas ✅ Industry standard used by data teams worldwide
import pandas as pd
import numpy as npReal-world fact: 90% of data analysts who know Python use Pandas daily.
DataFrames — The Core Structure
A DataFrame is like an Excel spreadsheet in Python — rows and columns of data.
Creating a DataFrame
# From a dictionary
data = {
'name': ['Rahul', 'Priya', 'Amit', 'Sneha'],
'age': [25, 28, 24, 26],
'city': ['Mumbai', 'Delhi', 'Bangalore', 'Mumbai'],
'salary': [50000, 60000, 45000, 55000]
}
df = pd.DataFrame(data)
print(df)Output: | | name | age | city | salary | |---|------|-----|----------|--------| | 0 | Rahul | 25 | Mumbai | 50000 | | 1 | Priya | 28 | Delhi | 60000 | | 2 | Amit | 24 | Bangalore | 45000 | | 3 | Sneha | 26 | Mumbai | 55000 |
Reading Data from Files
Read CSV
# Basic read
df = pd.read_csv('sales_data.csv')
# With options
df = pd.read_csv('data.csv',
sep=',', # Delimiter
encoding='utf-8', # Encoding
na_values=['NA', 'Missing']) # Treat these as nullRead Excel
# Single sheet
df = pd.read_excel('data.xlsx', sheet_name='Sales')
# Multiple sheets
sales_df = pd.read_excel('data.xlsx', sheet_name='Sales')
products_df = pd.read_excel('data.xlsx', sheet_name='Products')Read from SQL Database
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM orders WHERE date >= "2026-01-01"', conn)Exploring Your Data
Quick Overview
# First 5 rows
df.head()
# Last 5 rows
df.tail()
# Random 5 rows
df.sample(5)
# Shape (rows, columns)
print(df.shape) # (1000, 8)
# Column names
print(df.columns)
# Data types
print(df.dtypes)
# Summary statistics
print(df.describe())
# Detailed info
df.info()Example Output of df.info():
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1000 non-null object
1 age 998 non-null float64
2 city 995 non-null object
3 salary 1000 non-null int64
Selecting Data
Select Columns
# Single column (returns Series)
ages = df['age']
# Multiple columns (returns DataFrame)
subset = df[['name', 'salary']]
# All columns except one
df.drop('age', axis=1)Select Rows by Position
# First row
df.iloc[0]
# First 10 rows
df.iloc[0:10]
# Specific rows
df.iloc[[0, 5, 10]]
# Rows 10-20, columns 2-4
df.iloc[10:20, 2:4]Select Rows by Label
# Set index first
df_indexed = df.set_index('name')
# Select by index
df_indexed.loc['Rahul']Filtering Data
Single Condition
# High earners
high_earners = df[df['salary'] > 50000]
# From Mumbai
mumbai_people = df[df['city'] == 'Mumbai']
# Age below 25
young = df[df['age'] < 25]Multiple Conditions (AND)
# Young AND high earners
young_rich = df[(df['age'] < 30) & (df['salary'] > 50000)]Multiple Conditions (OR)
# Mumbai OR Delhi
target_cities = df[(df['city'] == 'Mumbai') | (df['city'] == 'Delhi')]Using .isin()
# Multiple cities
cities = ['Mumbai', 'Delhi', 'Bangalore']
metro_people = df[df['city'].isin(cities)]String Contains
# Names starting with 'R'
r_names = df[df['name'].str.startswith('R')]
# Cities containing 'ore'
ore_cities = df[df['city'].str.contains('ore')]Data Cleaning
1. Handling Missing Values
# Check for missing values
print(df.isnull().sum())
# Visualize missing data
print(df.isnull().sum() / len(df) * 100) # Percentage
# Drop rows with ANY missing value
df_clean = df.dropna()
# Drop rows where specific column is null
df_clean = df.dropna(subset=['email'])
# Drop columns with too many nulls
df_clean = df.dropna(axis=1, thresh=500) # Keep cols with 500+ non-null
# Fill missing values
df['age'].fillna(df['age'].mean(), inplace=True) # Fill with mean
df['city'].fillna('Unknown', inplace=True) # Fill with constant
df['salary'].fillna(method='ffill', inplace=True) # Forward fill2. Removing Duplicates
# Check for duplicates
print(f"Duplicates: {df.duplicated().sum()}")
# Remove all duplicates
df_unique = df.drop_duplicates()
# Remove duplicates based on specific columns
df_unique = df.drop_duplicates(subset=['email'], keep='first')3. Cleaning Text Data
# Convert to lowercase
df['name'] = df['name'].str.lower()
# Remove leading/trailing whitespace
df['name'] = df['name'].str.strip()
# Replace text
df['city'] = df['city'].str.replace('Banglore', 'Bangalore')
# Remove special characters
df['phone'] = df['phone'].str.replace(r'\D', '', regex=True) # Keep only digits4. Converting Data Types
# Convert to numeric
df['salary'] = pd.to_numeric(df['salary'], errors='coerce')
# Convert to datetime
df['date'] = pd.to_datetime(df['date'])
# Convert to categorical (saves memory)
df['city'] = df['city'].astype('category')Sorting Data
# Sort by one column
df_sorted = df.sort_values('salary', ascending=False)
# Sort by multiple columns
df_sorted = df.sort_values(['city', 'salary'], ascending=[True, False])
# Reset index after sorting
df_sorted = df_sorted.reset_index(drop=True)Creating New Columns
Simple Calculations
# Add new column
df['salary_with_gst'] = df['salary'] * 1.18
# Multiple columns
df['annual_salary'] = df['salary'] * 12Conditional Columns
# Using np.where (if-else)
df['category'] = np.where(df['salary'] > 50000, 'High', 'Low')
# Using apply with lambda
df['age_group'] = df['age'].apply(
lambda x: 'Young' if x < 25 else ('Mid' if x < 35 else 'Senior')
)
# Using conditions
conditions = [
df['salary'] > 70000,
df['salary'] > 50000,
df['salary'] > 30000
]
choices = ['Platinum', 'Gold', 'Silver']
df['tier'] = np.select(conditions, choices, default='Bronze')GroupBy — Aggregations
Basic GroupBy
# Average salary by city
city_avg = df.groupby('city')['salary'].mean()
# Count by city
city_count = df.groupby('city').size()
# Multiple aggregations
city_stats = df.groupby('city')['salary'].agg(['mean', 'min', 'max', 'count'])Advanced GroupBy
# Multiple columns
multi_group = df.groupby(['city', 'age_group'])['salary'].mean()
# Multiple aggregations on different columns
summary = df.groupby('city').agg({
'salary': ['mean', 'sum'],
'age': 'mean',
'name': 'count'
}).reset_index()
# Custom aggregation
df.groupby('city')['salary'].agg(
avg_salary='mean',
total_salary='sum',
salary_range=lambda x: x.max() - x.min()
)Merging DataFrames
Sample Data
# Customers
customers = pd.DataFrame({
'customer_id': [1, 2, 3],
'name': ['Rahul', 'Priya', 'Amit']
})
# Orders
orders = pd.DataFrame({
'order_id': [101, 102, 103],
'customer_id': [1, 1, 2],
'amount': [5000, 3000, 2000]
})Inner Join
merged = pd.merge(customers, orders, on='customer_id', how='inner')Left Join
merged = pd.merge(customers, orders, on='customer_id', how='left')Multiple Keys
merged = pd.merge(df1, df2, on=['customer_id', 'date'], how='inner')Pivot Tables
# Create pivot table
pivot = df.pivot_table(
values='salary',
index='city',
columns='age_group',
aggfunc='mean'
)
# Multiple aggregations
pivot_multi = df.pivot_table(
values='salary',
index='city',
aggfunc=['mean', 'count', 'sum']
)Real-World Example: Sales Analysis
import pandas as pd
# Read sales data
sales = pd.read_csv('sales_2026.csv')
# 1. Data Cleaning
sales = sales.dropna(subset=['amount', 'customer_id'])
sales = sales.drop_duplicates()
sales['date'] = pd.to_datetime(sales['date'])
sales['city'] = sales['city'].str.strip().str.title()
# 2. Feature Engineering
sales['month'] = sales['date'].dt.month
sales['quarter'] = sales['date'].dt.quarter
sales['is_high_value'] = sales['amount'] > 10000
# 3. Filter for Q1 2026
q1_sales = sales[
(sales['date'] >= '2026-01-01') &
(sales['date'] <= '2026-03-31')
]
# 4. Aggregations
revenue_by_city = q1_sales.groupby('city').agg({
'amount': 'sum',
'order_id': 'count'
}).rename(columns={'amount': 'total_revenue', 'order_id': 'order_count'})
# 5. Sort and get top 10 cities
top_cities = revenue_by_city.sort_values('total_revenue', ascending=False).head(10)
# 6. Export results
top_cities.to_csv('top_cities_q1_2026.csv')
print("Analysis complete!")Exporting Data
# To CSV
df.to_csv('output.csv', index=False)
# To Excel
df.to_excel('output.xlsx', sheet_name='Sales', index=False)
# To SQL
df.to_sql('sales_table', conn, if_exists='replace', index=False)
# To JSON
df.to_json('output.json', orient='records')Common Mistakes & Solutions
❌ Mistake 1: SettingWithCopyWarning
Problem:
df[df['age'] > 25]['salary'] = 60000 # Warning!Solution:
df.loc[df['age'] > 25, 'salary'] = 60000 # Correct❌ Mistake 2: Forgetting inplace=True
Problem:
df.dropna() # Doesn't change dfSolution:
df = df.dropna() # Reassign
# OR
df.dropna(inplace=True) # Modify in place❌ Mistake 3: Wrong Operator for Conditions
Problem:
df[df['age'] > 25 and df['salary'] > 50000] # Error!Solution:
df[(df['age'] > 25) & (df['salary'] > 50000)] # CorrectSummary
✅ DataFrames are like Excel tables in Python
✅ Read data from CSV, Excel, SQL with pd.read_*
✅ Filter rows: df[df['col'] > value]
✅ Clean data: dropna(), drop_duplicates(), fillna()
✅ Create columns with calculations and conditions
✅ GroupBy for aggregations: df.groupby('col').agg()
✅ Merge DataFrames with pd.merge()
✅ Export results to CSV, Excel, SQL
Next Topic: Data Visualization Principles
Ready to turn your data into beautiful charts? Let's go! 📊