Topic 27 of

Pandas DataFrames Tutorial — Complete Guide for Analysts

If SQL is the language of databases, Pandas is the language of data analysis in Python. Master DataFrames and you'll never struggle with messy data again.

📚Beginner
⏱️14 min
10 quizzes
📊

What is a Pandas DataFrame?

A Pandas DataFrame is a two-dimensional labeled data structure with columns that can be of different types — like a spreadsheet or SQL table in Python. It's the most important data structure for data analysis.

Anatomy of a DataFrame

code.pyPython
import pandas as pd

# Create a sample DataFrame
data = {
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'customer': ['Priya', 'Rahul', 'Anjali', 'Vikas', 'Neha'],
    'city': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Chennai'],
    'amount': [2500, 3200, 1800, 4100, 2900],
    'status': ['delivered', 'delivered', 'pending', 'delivered', 'cancelled']
}

df = pd.DataFrame(data)
print(df)

Output:

order_id customer city amount status 0 1001 Priya Mumbai 2500 delivered 1 1002 Rahul Delhi 3200 delivered 2 1003 Anjali Mumbai 1800 pending 3 1004 Vikas Bangalore 4100 delivered 4 1005 Neha Chennai 2900 cancelled

Key components:

  • Index: The row labels (0, 1, 2, 3, 4 by default)
  • Columns: The column names ('order_id', 'customer', 'city', 'amount', 'status')
  • Values: The actual data in each cell

Creating DataFrames from Different Sources

code.pyPython
# From a list of dictionaries (each dict is a row)
orders = [
    {'order_id': 1001, 'city': 'Mumbai', 'amount': 2500},
    {'order_id': 1002, 'city': 'Delhi', 'amount': 3200},
    {'order_id': 1003, 'city': 'Mumbai', 'amount': 1800}
]
df = pd.DataFrame(orders)

# From a CSV file (most common in real work)
df = pd.read_csv('swiggy_orders.csv')

# From Excel
df = pd.read_excel('sales_data.xlsx', sheet_name='March_2026')

# From a SQL query
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM orders WHERE city = "Mumbai"', conn)

# From JSON (API responses)
df = pd.read_json('api_data.json')
Think of it this way...

A DataFrame is like a smart spreadsheet. It has all the structure of Excel (rows, columns, formulas) but with the power of programming — you can process millions of rows instantly, apply complex logic, and never lose track of what you did.

🎯

Selecting Columns and Rows

The most common operation in data analysis: getting exactly the data you need. Pandas offers multiple ways to select data.

Selecting Columns

code.pyPython
# Single column (returns a Series)
cities = df['city']
print(type(cities))  # <class 'pandas.core.series.Series'>

# Multiple columns (returns a DataFrame)
subset = df[['customer', 'city', 'amount']]
print(subset)

# All columns except some
df_without_status = df.drop(columns=['status'])

Selecting Rows by Position with .iloc

.iloc uses integer position (0-indexed, like Python lists):

code.pyPython
# First row
first_row = df.iloc[0]

# First 3 rows
first_three = df.iloc[0:3]  # or df.iloc[:3]

# Last row
last_row = df.iloc[-1]

# Specific rows (1st, 3rd, 5th)
specific = df.iloc[[0, 2, 4]]

# Rows and columns by position
# Rows 0-2, columns 0-1
subset = df.iloc[0:3, 0:2]

Selecting Rows by Label with .loc

.loc uses labels (index and column names):

code.pyPython
# Select rows by index label (if index is set to a custom value)
df_indexed = df.set_index('order_id')
order_1001 = df_indexed.loc[1001]

# Select rows and columns by label
# Rows 0-2, columns 'customer' and 'amount'
subset = df.loc[0:2, ['customer', 'amount']]

# Select all rows, specific columns
cities_amounts = df.loc[:, ['city', 'amount']]

Filtering Rows with Boolean Indexing

This is how you'll select data most often — using conditions:

code.pyPython
# Orders over ₹3000
high_value = df[df['amount'] > 3000]

# Orders from Mumbai
mumbai = df[df['city'] == 'Mumbai']

# Multiple conditions with & (and) or | (or)
mumbai_delivered = df[(df['city'] == 'Mumbai') & (df['status'] == 'delivered')]

# OR condition: Mumbai or Delhi
metro = df[(df['city'] == 'Mumbai') | (df['city'] == 'Delhi')]

# Using .isin() for multiple values (cleaner than multiple OR)
metro = df[df['city'].isin(['Mumbai', 'Delhi', 'Bangalore'])]

# NOT condition with ~
not_cancelled = df[~(df['status'] == 'cancelled')]
# Or simpler:
not_cancelled = df[df['status'] != 'cancelled']
Info

Important: When using multiple conditions, always wrap each condition in parentheses and use & for AND, | for OR. Python's and and or keywords don't work with Pandas boolean indexing.

⚠️ CheckpointQuiz error: Missing or invalid options array

✏️

Adding and Modifying Data

DataFrames are mutable — you can add columns, modify values, and insert rows.

Adding New Columns

code.pyPython
# Add a constant column
df['year'] = 2026

# Calculate from existing columns
df['gst'] = df['amount'] * 0.18
df['total_with_gst'] = df['amount'] + df['gst']

# Conditional column with .apply()
def categorize_order(amount):
    if amount > 3000:
        return 'High'
    elif amount > 1500:
        return 'Medium'
    else:
        return 'Low'

df['category'] = df['amount'].apply(categorize_order)

# Conditional column with np.where (faster for simple conditions)
import numpy as np
df['is_high_value'] = np.where(df['amount'] > 3000, 'Yes', 'No')

# Multiple conditions with np.select
conditions = [
    df['amount'] > 3000,
    df['amount'] > 1500,
    df['amount'] <= 1500
]
choices = ['High', 'Medium', 'Low']
df['category'] = np.select(conditions, choices, default='Unknown')

Modifying Existing Values

code.pyPython
# Change all values in a column
df['city'] = df['city'].str.upper()  # MUMBAI, DELHI, etc.

# Change specific values
df.loc[df['city'] == 'MUMBAI', 'city'] = 'Mumbai'

# Replace specific values
df['status'] = df['status'].replace({'pending': 'in_progress', 'cancelled': 'failed'})

# Replace multiple values with a mapping
city_mapping = {'Mumbai': 'MH', 'Delhi': 'DL', 'Bangalore': 'KA', 'Chennai': 'TN'}
df['state_code'] = df['city'].map(city_mapping)

Adding Rows

code.pyPython
# Create a new row as a dictionary
new_order = {
    'order_id': 1006,
    'customer': 'Amit',
    'city': 'Pune',
    'amount': 3500,
    'status': 'delivered'
}

# Append using pd.concat (preferred in modern Pandas)
new_row_df = pd.DataFrame([new_order])
df = pd.concat([df, new_row_df], ignore_index=True)

# Note: df.append() is deprecated as of Pandas 1.4.0

Dropping Columns and Rows

code.pyPython
# Drop columns
df = df.drop(columns=['gst', 'total_with_gst'])

# Drop rows by index
df = df.drop([0, 2])  # Drop rows at index 0 and 2

# Drop rows by condition (keep only what you want)
df = df[df['status'] != 'cancelled']  # Remove cancelled orders
🔍

Inspecting and Summarizing DataFrames

Before analyzing data, you need to understand what you're working with. These methods give you quick insights.

Basic Information

code.pyPython
# First and last rows
df.head()       # First 5 rows
df.head(10)     # First 10 rows
df.tail()       # Last 5 rows

# Random sample
df.sample(5)    # 5 random rows (useful for large datasets)

# Shape: (rows, columns)
print(df.shape)  # (1000, 5) means 1000 rows, 5 columns

# Column names
print(df.columns)  # Index(['order_id', 'customer', 'city', 'amount', 'status'], dtype='object')

# Data types of each column
print(df.dtypes)
# order_id      int64
# customer     object
# city         object
# amount        int64
# status       object

# Memory usage
print(df.memory_usage(deep=True))

# Concise summary
df.info()

Example output of df.info():

<class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 order_id 1000 non-null int64 1 customer 1000 non-null object 2 city 998 non-null object 3 amount 1000 non-null int64 4 status 1000 non-null object dtypes: int64(2), object(3) memory usage: 39.2+ KB

Statistical Summary

code.pyPython
# Summary statistics for numeric columns
df.describe()

Output:

order_id amount count 1000.000000 1000.000000 mean 1500.500000 2847.300000 std 288.819436 976.234111 min 1001.000000 500.000000 25% 1250.750000 2100.000000 50% 1500.500000 2800.000000 75% 1750.250000 3600.000000 max 2000.000000 5000.000000

Value Counts and Unique Values

code.pyPython
# Count occurrences of each unique value
print(df['city'].value_counts())
# Mumbai       350
# Delhi        280
# Bangalore    220
# Chennai      150

# Unique values
print(df['city'].unique())
# array(['Mumbai', 'Delhi', 'Bangalore', 'Chennai'], dtype=object)

# Number of unique values
print(df['city'].nunique())  # 4

# Check for missing values
print(df.isnull().sum())
# order_id     0
# customer     0
# city         2
# amount       0
# status       0

# Percentage of missing values
print((df.isnull().sum() / len(df)) * 100)
Think of it this way...

When Swiggy analysts receive a new dataset of restaurant orders, they run df.info(), df.describe(), and df.isnull().sum() first. This 30-second check reveals data types, missing values, and outliers before spending hours on flawed analysis.

↕️

Sorting and Ranking Data

Sorting reveals patterns: top customers, worst-performing products, chronological order for time-series analysis.

Sorting by Values

code.pyPython
# Sort by one column (ascending)
df_sorted = df.sort_values('amount')

# Sort descending
df_sorted = df.sort_values('amount', ascending=False)

# Sort by multiple columns
df_sorted = df.sort_values(['city', 'amount'], ascending=[True, False])
# Sorts by city A-Z, then within each city by amount highest-first

# Sort by index
df_sorted = df.sort_index()

Ranking Values

code.pyPython
# Add a rank column (1 = highest amount)
df['amount_rank'] = df['amount'].rank(ascending=False)

# Rank within groups (rank by amount within each city)
df['city_rank'] = df.groupby('city')['amount'].rank(ascending=False)

# Display top 3 orders per city
top_3_per_city = df[df['city_rank'] <= 3].sort_values(['city', 'city_rank'])
print(top_3_per_city[['order_id', 'city', 'amount', 'city_rank']])

Finding Top N and Bottom N

code.pyPython
# Top 10 orders by amount
top_10 = df.nlargest(10, 'amount')

# Bottom 10 orders
bottom_10 = df.nsmallest(10, 'amount')

# Top 5 per city (requires groupby)
top_5_per_city = df.groupby('city').apply(lambda x: x.nlargest(5, 'amount')).reset_index(drop=True)

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}