6 min read min read
GroupBy Basics
Learn to group data and calculate summaries for each group
GroupBy Basics
What is GroupBy?
GroupBy is like sorting papers into piles and then counting each pile.
Imagine you have sales data. GroupBy lets you answer:
- How much did each store sell?
- What's the average sale per product?
- How many orders per customer?
Basic GroupBy Example
code.py
import pandas as pd
# Sales data
sales = pd.DataFrame({
'Store': ['NYC', 'NYC', 'LA', 'LA', 'LA'],
'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Apple'],
'Sales': [100, 150, 120, 180, 90]
})
print(sales)Output:
Store Product Sales
0 NYC Apple 100
1 NYC Banana 150
2 LA Apple 120
3 LA Banana 180
4 LA Apple 90
Now let's group by Store:
code.py
# Group by store and get total sales
store_sales = sales.groupby('Store')['Sales'].sum()
print(store_sales)Output:
Store
LA 390
NYC 250
Name: Sales, dtype: int64
LA sold 390, NYC sold 250.
How GroupBy Works
GroupBy has 3 steps:
- Split - Divide data into groups
- Apply - Do something to each group (sum, mean, count)
- Combine - Put results back together
Original Data → Split by Store → Calculate Sum → Result
NYC: 100, 150 → Sum = 250
LA: 120, 180, 90 → Sum = 390
Common GroupBy Operations
Sum - Total for Each Group
code.py
total_sales = sales.groupby('Store')['Sales'].sum()
print(total_sales)Mean - Average for Each Group
code.py
avg_sales = sales.groupby('Store')['Sales'].mean()
print(avg_sales)Output:
Store
LA 130.0
NYC 125.0
Name: Sales, dtype: float64
Count - How Many in Each Group
code.py
count_sales = sales.groupby('Store')['Sales'].count()
print(count_sales)Output:
Store
LA 3
NYC 2
Name: Sales, dtype: int64
Min and Max
code.py
min_sales = sales.groupby('Store')['Sales'].min()
max_sales = sales.groupby('Store')['Sales'].max()
print("Minimum:")
print(min_sales)
print("\nMaximum:")
print(max_sales)GroupBy on Multiple Columns
code.py
# Group by both Store and Product
grouped = sales.groupby(['Store', 'Product'])['Sales'].sum()
print(grouped)Output:
Store Product
LA Apple 210
Banana 180
NYC Apple 100
Banana 150
Name: Sales, dtype: int64
LA sold 210 worth of Apples.
Get Result as DataFrame
By default, groupby returns a Series. To get a DataFrame:
code.py
# Method 1: reset_index()
result = sales.groupby('Store')['Sales'].sum().reset_index()
print(result)Output:
Store Sales
0 LA 390
1 NYC 250
code.py
# Method 2: as_index=False
result = sales.groupby('Store', as_index=False)['Sales'].sum()
print(result)Same output.
Practice Example
code.py
import pandas as pd
# Employee data
employees = pd.DataFrame({
'Department': ['Sales', 'Sales', 'IT', 'IT', 'HR'],
'Name': ['John', 'Sarah', 'Mike', 'Emma', 'Tom'],
'Salary': [50000, 55000, 70000, 65000, 45000]
})
print("Employee Data:")
print(employees)
# Total salary by department
print("\nTotal salary by department:")
print(employees.groupby('Department')['Salary'].sum())
# Average salary by department
print("\nAverage salary by department:")
print(employees.groupby('Department')['Salary'].mean())
# Number of employees per department
print("\nEmployees per department:")
print(employees.groupby('Department')['Name'].count())Key Points
df.groupby('column')groups data by that column- Add
['column'].sum()or.mean()to calculate - Common operations: sum(), mean(), count(), min(), max()
- Use
reset_index()to get DataFrame result - Group by multiple columns:
groupby(['col1', 'col2'])
Common Mistakes
Mistake 1: Forgetting to select a column for calculation
code.py
# This groups but doesn't calculate anything useful
sales.groupby('Store')
# Add what you want to calculate
sales.groupby('Store')['Sales'].sum()Mistake 2: Using wrong column for calculation
code.py
# Can't sum text columns
sales.groupby('Store')['Product'].sum() # Wrong!
# Sum numeric columns
sales.groupby('Store')['Sales'].sum() # CorrectWhat's Next?
You learned basic groupby. Next, you'll learn about multiple aggregations - getting sum, mean, and count all at once.