#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
6 min read min read

GroupBy Basics

Learn to group data and calculate summaries for each group

GroupBy Basics

What is GroupBy?

GroupBy is like sorting papers into piles and then counting each pile.

Imagine you have sales data. GroupBy lets you answer:

  • How much did each store sell?
  • What's the average sale per product?
  • How many orders per customer?

Basic GroupBy Example

code.py
import pandas as pd

# Sales data
sales = pd.DataFrame({
    'Store': ['NYC', 'NYC', 'LA', 'LA', 'LA'],
    'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Apple'],
    'Sales': [100, 150, 120, 180, 90]
})

print(sales)

Output:

Store Product Sales 0 NYC Apple 100 1 NYC Banana 150 2 LA Apple 120 3 LA Banana 180 4 LA Apple 90

Now let's group by Store:

code.py
# Group by store and get total sales
store_sales = sales.groupby('Store')['Sales'].sum()
print(store_sales)

Output:

Store LA 390 NYC 250 Name: Sales, dtype: int64

LA sold 390, NYC sold 250.

How GroupBy Works

GroupBy has 3 steps:

  1. Split - Divide data into groups
  2. Apply - Do something to each group (sum, mean, count)
  3. Combine - Put results back together
Original Data → Split by Store → Calculate Sum → Result NYC: 100, 150 → Sum = 250 LA: 120, 180, 90 → Sum = 390

Common GroupBy Operations

Sum - Total for Each Group

code.py
total_sales = sales.groupby('Store')['Sales'].sum()
print(total_sales)

Mean - Average for Each Group

code.py
avg_sales = sales.groupby('Store')['Sales'].mean()
print(avg_sales)

Output:

Store LA 130.0 NYC 125.0 Name: Sales, dtype: float64

Count - How Many in Each Group

code.py
count_sales = sales.groupby('Store')['Sales'].count()
print(count_sales)

Output:

Store LA 3 NYC 2 Name: Sales, dtype: int64

Min and Max

code.py
min_sales = sales.groupby('Store')['Sales'].min()
max_sales = sales.groupby('Store')['Sales'].max()

print("Minimum:")
print(min_sales)
print("\nMaximum:")
print(max_sales)

GroupBy on Multiple Columns

code.py
# Group by both Store and Product
grouped = sales.groupby(['Store', 'Product'])['Sales'].sum()
print(grouped)

Output:

Store Product LA Apple 210 Banana 180 NYC Apple 100 Banana 150 Name: Sales, dtype: int64

LA sold 210 worth of Apples.

Get Result as DataFrame

By default, groupby returns a Series. To get a DataFrame:

code.py
# Method 1: reset_index()
result = sales.groupby('Store')['Sales'].sum().reset_index()
print(result)

Output:

Store Sales 0 LA 390 1 NYC 250
code.py
# Method 2: as_index=False
result = sales.groupby('Store', as_index=False)['Sales'].sum()
print(result)

Same output.

Practice Example

code.py
import pandas as pd

# Employee data
employees = pd.DataFrame({
    'Department': ['Sales', 'Sales', 'IT', 'IT', 'HR'],
    'Name': ['John', 'Sarah', 'Mike', 'Emma', 'Tom'],
    'Salary': [50000, 55000, 70000, 65000, 45000]
})

print("Employee Data:")
print(employees)

# Total salary by department
print("\nTotal salary by department:")
print(employees.groupby('Department')['Salary'].sum())

# Average salary by department
print("\nAverage salary by department:")
print(employees.groupby('Department')['Salary'].mean())

# Number of employees per department
print("\nEmployees per department:")
print(employees.groupby('Department')['Name'].count())

Key Points

  • df.groupby('column') groups data by that column
  • Add ['column'].sum() or .mean() to calculate
  • Common operations: sum(), mean(), count(), min(), max()
  • Use reset_index() to get DataFrame result
  • Group by multiple columns: groupby(['col1', 'col2'])

Common Mistakes

Mistake 1: Forgetting to select a column for calculation

code.py
# This groups but doesn't calculate anything useful
sales.groupby('Store')

# Add what you want to calculate
sales.groupby('Store')['Sales'].sum()

Mistake 2: Using wrong column for calculation

code.py
# Can't sum text columns
sales.groupby('Store')['Product'].sum()  # Wrong!

# Sum numeric columns
sales.groupby('Store')['Sales'].sum()    # Correct

What's Next?

You learned basic groupby. Next, you'll learn about multiple aggregations - getting sum, mean, and count all at once.

SkillsetMaster - AI, Web Development & Data Analytics Courses