What is Pandas DataFrames Tutorial — Complete Guide for Analysts?

Master Pandas DataFrames from scratch. Learn indexing, filtering, selecting columns, adding rows, and essential operations with real Indian business examples.

Is Pandas DataFrames Tutorial — Complete Guide for Analysts suitable for beginners?

This topic is designed for Beginner level learners. It takes approximately 14 min to complete and includes 10 interactive quizzes to test your understanding.

How long does it take to learn Pandas DataFrames Tutorial — Complete Guide for Analysts?

You can complete this topic in about 14 min. The topic is part 27 of undefined in our comprehensive Data Analytics Learning Path.

Pandas DataFrames Tutorial — Complete Guide with Examples | DataPath

📊

What is a Pandas DataFrame?

A Pandas DataFrame is a two-dimensional labeled data structure with columns that can be of different types — like a spreadsheet or SQL table in Python. It's the most important data structure for data analysis.

Anatomy of a DataFrame

code.pyPython

import pandas as pd

# Create a sample DataFrame
data = {
    'order_id': [1001, 1002, 1003, 1004, 1005],
    'customer': ['Priya', 'Rahul', 'Anjali', 'Vikas', 'Neha'],
    'city': ['Mumbai', 'Delhi', 'Mumbai', 'Bangalore', 'Chennai'],
    'amount': [2500, 3200, 1800, 4100, 2900],
    'status': ['delivered', 'delivered', 'pending', 'delivered', 'cancelled']
}

df = pd.DataFrame(data)
print(df)

Output:

   order_id customer       city  amount     status
0      1001    Priya     Mumbai    2500  delivered
1      1002    Rahul      Delhi    3200  delivered
2      1003   Anjali     Mumbai    1800    pending
3      1004    Vikas  Bangalore    4100  delivered
4      1005     Neha    Chennai    2900  cancelled

Key components:

Index: The row labels (0, 1, 2, 3, 4 by default)
Columns: The column names ('order_id', 'customer', 'city', 'amount', 'status')
Values: The actual data in each cell

Creating DataFrames from Different Sources

code.pyPython

# From a list of dictionaries (each dict is a row)
orders = [
    {'order_id': 1001, 'city': 'Mumbai', 'amount': 2500},
    {'order_id': 1002, 'city': 'Delhi', 'amount': 3200},
    {'order_id': 1003, 'city': 'Mumbai', 'amount': 1800}
]
df = pd.DataFrame(orders)

# From a CSV file (most common in real work)
df = pd.read_csv('swiggy_orders.csv')

# From Excel
df = pd.read_excel('sales_data.xlsx', sheet_name='March_2026')

# From a SQL query
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql_query('SELECT * FROM orders WHERE city = "Mumbai"', conn)

# From JSON (API responses)
df = pd.read_json('api_data.json')

Think of it this way...

A DataFrame is like a smart spreadsheet. It has all the structure of Excel (rows, columns, formulas) but with the power of programming — you can process millions of rows instantly, apply complex logic, and never lose track of what you did.

🎯

Selecting Columns and Rows

The most common operation in data analysis: getting exactly the data you need. Pandas offers multiple ways to select data.

Selecting Columns

code.pyPython

# Single column (returns a Series)
cities = df['city']
print(type(cities))  # <class 'pandas.core.series.Series'>

# Multiple columns (returns a DataFrame)
subset = df[['customer', 'city', 'amount']]
print(subset)

# All columns except some
df_without_status = df.drop(columns=['status'])

Selecting Rows by Position with .iloc

.iloc uses integer position (0-indexed, like Python lists):

code.pyPython

# First row
first_row = df.iloc[0]

# First 3 rows
first_three = df.iloc[0:3]  # or df.iloc[:3]

# Last row
last_row = df.iloc[-1]

# Specific rows (1st, 3rd, 5th)
specific = df.iloc[[0, 2, 4]]

# Rows and columns by position
# Rows 0-2, columns 0-1
subset = df.iloc[0:3, 0:2]

Selecting Rows by Label with .loc

.loc uses labels (index and column names):

code.pyPython

# Select rows by index label (if index is set to a custom value)
df_indexed = df.set_index('order_id')
order_1001 = df_indexed.loc[1001]

# Select rows and columns by label
# Rows 0-2, columns 'customer' and 'amount'
subset = df.loc[0:2, ['customer', 'amount']]

# Select all rows, specific columns
cities_amounts = df.loc[:, ['city', 'amount']]

Filtering Rows with Boolean Indexing

This is how you'll select data most often — using conditions:

code.pyPython

# Orders over ₹3000
high_value = df[df['amount'] > 3000]

# Orders from Mumbai
mumbai = df[df['city'] == 'Mumbai']

# Multiple conditions with & (and) or | (or)
mumbai_delivered = df[(df['city'] == 'Mumbai') & (df['status'] == 'delivered')]

# OR condition: Mumbai or Delhi
metro = df[(df['city'] == 'Mumbai') | (df['city'] == 'Delhi')]

# Using .isin() for multiple values (cleaner than multiple OR)
metro = df[df['city'].isin(['Mumbai', 'Delhi', 'Bangalore'])]

# NOT condition with ~
not_cancelled = df[~(df['status'] == 'cancelled')]
# Or simpler:
not_cancelled = df[df['status'] != 'cancelled']

Info

Important: When using multiple conditions, always wrap each condition in parentheses and use & for AND, | for OR. Python's and and or keywords don't work with Pandas boolean indexing.

⚠️ CheckpointQuiz error: Missing or invalid options array

✏️

Adding and Modifying Data

DataFrames are mutable — you can add columns, modify values, and insert rows.

Adding New Columns

code.pyPython

# Add a constant column
df['year'] = 2026

# Calculate from existing columns
df['gst'] = df['amount'] * 0.18
df['total_with_gst'] = df['amount'] + df['gst']

# Conditional column with .apply()
def categorize_order(amount):
    if amount > 3000:
        return 'High'
    elif amount > 1500:
        return 'Medium'
    else:
        return 'Low'

df['category'] = df['amount'].apply(categorize_order)

# Conditional column with np.where (faster for simple conditions)
import numpy as np
df['is_high_value'] = np.where(df['amount'] > 3000, 'Yes', 'No')

# Multiple conditions with np.select
conditions = [
    df['amount'] > 3000,
    df['amount'] > 1500,
    df['amount'] <= 1500
]
choices = ['High', 'Medium', 'Low']
df['category'] = np.select(conditions, choices, default='Unknown')

Modifying Existing Values

code.pyPython

# Change all values in a column
df['city'] = df['city'].str.upper()  # MUMBAI, DELHI, etc.

# Change specific values
df.loc[df['city'] == 'MUMBAI', 'city'] = 'Mumbai'

# Replace specific values
df['status'] = df['status'].replace({'pending': 'in_progress', 'cancelled': 'failed'})

# Replace multiple values with a mapping
city_mapping = {'Mumbai': 'MH', 'Delhi': 'DL', 'Bangalore': 'KA', 'Chennai': 'TN'}
df['state_code'] = df['city'].map(city_mapping)

Adding Rows

code.pyPython

# Create a new row as a dictionary
new_order = {
    'order_id': 1006,
    'customer': 'Amit',
    'city': 'Pune',
    'amount': 3500,
    'status': 'delivered'
}

# Append using pd.concat (preferred in modern Pandas)
new_row_df = pd.DataFrame([new_order])
df = pd.concat([df, new_row_df], ignore_index=True)

# Note: df.append() is deprecated as of Pandas 1.4.0

Dropping Columns and Rows

code.pyPython

# Drop columns
df = df.drop(columns=['gst', 'total_with_gst'])

# Drop rows by index
df = df.drop([0, 2])  # Drop rows at index 0 and 2

# Drop rows by condition (keep only what you want)
df = df[df['status'] != 'cancelled']  # Remove cancelled orders

🔍

Inspecting and Summarizing DataFrames

Before analyzing data, you need to understand what you're working with. These methods give you quick insights.

Basic Information

code.pyPython

# First and last rows
df.head()       # First 5 rows
df.head(10)     # First 10 rows
df.tail()       # Last 5 rows

# Random sample
df.sample(5)    # 5 random rows (useful for large datasets)

# Shape: (rows, columns)
print(df.shape)  # (1000, 5) means 1000 rows, 5 columns

# Column names
print(df.columns)  # Index(['order_id', 'customer', 'city', 'amount', 'status'], dtype='object')

# Data types of each column
print(df.dtypes)
# order_id      int64
# customer     object
# city         object
# amount        int64
# status       object

# Memory usage
print(df.memory_usage(deep=True))

# Concise summary
df.info()

Example output of df.info():

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   order_id   1000 non-null   int64
 1   customer   1000 non-null   object
 2   city       998 non-null    object
 3   amount     1000 non-null   int64
 4   status     1000 non-null   object
dtypes: int64(2), object(3)
memory usage: 39.2+ KB

Statistical Summary

code.pyPython

# Summary statistics for numeric columns
df.describe()

Output:

         order_id        amount
count  1000.000000   1000.000000
mean   1500.500000   2847.300000
std     288.819436    976.234111
min    1001.000000    500.000000
25%    1250.750000   2100.000000
50%    1500.500000   2800.000000
75%    1750.250000   3600.000000
max    2000.000000   5000.000000

Value Counts and Unique Values

code.pyPython

# Count occurrences of each unique value
print(df['city'].value_counts())
# Mumbai       350
# Delhi        280
# Bangalore    220
# Chennai      150

# Unique values
print(df['city'].unique())
# array(['Mumbai', 'Delhi', 'Bangalore', 'Chennai'], dtype=object)

# Number of unique values
print(df['city'].nunique())  # 4

# Check for missing values
print(df.isnull().sum())
# order_id     0
# customer     0
# city         2
# amount       0
# status       0

# Percentage of missing values
print((df.isnull().sum() / len(df)) * 100)

Think of it this way...

When Swiggy analysts receive a new dataset of restaurant orders, they run df.info(), df.describe(), and df.isnull().sum() first. This 30-second check reveals data types, missing values, and outliers before spending hours on flawed analysis.

↕️

Sorting and Ranking Data

Sorting reveals patterns: top customers, worst-performing products, chronological order for time-series analysis.

Sorting by Values

code.pyPython

# Sort by one column (ascending)
df_sorted = df.sort_values('amount')

# Sort descending
df_sorted = df.sort_values('amount', ascending=False)

# Sort by multiple columns
df_sorted = df.sort_values(['city', 'amount'], ascending=[True, False])
# Sorts by city A-Z, then within each city by amount highest-first

# Sort by index
df_sorted = df.sort_index()

Ranking Values

code.pyPython

# Add a rank column (1 = highest amount)
df['amount_rank'] = df['amount'].rank(ascending=False)

# Rank within groups (rank by amount within each city)
df['city_rank'] = df.groupby('city')['amount'].rank(ascending=False)

# Display top 3 orders per city
top_3_per_city = df[df['city_rank'] <= 3].sort_values(['city', 'city_rank'])
print(top_3_per_city[['order_id', 'city', 'amount', 'city_rank']])

Finding Top N and Bottom N

code.pyPython

# Top 10 orders by amount
top_10 = df.nlargest(10, 'amount')

# Bottom 10 orders
bottom_10 = df.nsmallest(10, 'amount')

# Top 5 per city (requires groupby)
top_5_per_city = df.groupby('city').apply(lambda x: x.nlargest(5, 'amount')).reset_index(drop=True)

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}