Sorting Data
Learn to sort DataFrames by values, multiple columns, and custom order
Sorting Data
Sort by Single Column
import pandas as pd
df = pd.DataFrame({
'Name': ['John', 'Sarah', 'Mike', 'Emma'],
'Age': [25, 30, 28, 32],
'Salary': [50000, 75000, 55000, 80000]
})
sorted_df = df.sort_values('Age')
print(sorted_df)Output:
Name Age Salary
0 John 25 50000
2 Mike 28 55000
1 Sarah 30 75000
3 Emma 32 80000
Sorted by Age, lowest to highest.
Descending Order
sorted_df = df.sort_values('Salary', ascending=False)
print(sorted_df)Highest salary first.
Sort by Multiple Columns
df = pd.DataFrame({
'Department': ['Sales', 'IT', 'Sales', 'IT', 'HR'],
'Name': ['John', 'Sarah', 'Mike', 'Emma', 'David'],
'Salary': [50000, 75000, 55000, 75000, 60000]
})
sorted_df = df.sort_values(['Department', 'Salary'])
print(sorted_df)First by Department, then by Salary within each department.
Mixed Sort Order
sorted_df = df.sort_values(
['Department', 'Salary'],
ascending=[True, False]
)
print(sorted_df)Department ascending, Salary descending.
Sort in Place
df.sort_values('Age', inplace=True)
print(df)Modifies original DataFrame.
Sort by Index
df_shuffled = df.sample(frac=1)
print("Shuffled:")
print(df_shuffled)
df_sorted = df_shuffled.sort_index()
print("Sorted by index:")
print(df_sorted)Restores original index order.
Sort Index Descending
sorted_df = df.sort_index(ascending=False)
print(sorted_df)Sort Columns
Sort column names alphabetically.
sorted_df = df.sort_index(axis=1)
print(sorted_df)axis=1 means sort column names.
Practice Example
The scenario: Analyze and sort sales data.
import pandas as pd
sales = pd.DataFrame({
'Region': ['North', 'South', 'North', 'West', 'South', 'East'],
'Product': ['Laptop', 'Phone', 'Tablet', 'Laptop', 'Laptop', 'Phone'],
'Quantity': [5, 10, 7, 3, 8, 12],
'Revenue': [4995, 5990, 2793, 2997, 7992, 7188]
})
print("Original data:")
print(sales)
print()
print("1. Sort by Revenue (highest first):")
by_revenue = sales.sort_values('Revenue', ascending=False)
print(by_revenue)
print()
print("2. Sort by Region, then Revenue:")
by_region_revenue = sales.sort_values(['Region', 'Revenue'], ascending=[True, False])
print(by_region_revenue)
print()
print("3. Sort by Quantity (lowest first):")
by_quantity = sales.sort_values('Quantity')
print(by_quantity)
print()
print("4. Top 3 sales by revenue:")
top_3 = sales.sort_values('Revenue', ascending=False).head(3)
print(top_3)
print()
print("5. Sort by Product, then Quantity descending:")
by_product = sales.sort_values(['Product', 'Quantity'], ascending=[True, False])
print(by_product)Handle Missing Values
df = pd.DataFrame({
'Name': ['John', 'Sarah', 'Mike'],
'Score': [85, None, 92]
})
sorted_first = df.sort_values('Score', na_position='first')
print("NaN first:")
print(sorted_first)
sorted_last = df.sort_values('Score', na_position='last')
print("NaN last:")
print(sorted_last)Default is na_position='last'.
Sort with Custom Key
Sort by string length.
df = pd.DataFrame({
'Name': ['John', 'Alexander', 'Em', 'Sarah']
})
sorted_df = df.sort_values('Name', key=lambda x: x.str.len())
print(sorted_df)Sorts by name length, not alphabetically.
Case-Insensitive Sort
df = pd.DataFrame({
'Name': ['john', 'Sarah', 'MIKE', 'emma']
})
sorted_df = df.sort_values('Name', key=lambda x: x.str.lower())
print(sorted_df)Sort and Reset Index
sorted_df = df.sort_values('Age').reset_index(drop=True)
print(sorted_df)Creates new 0, 1, 2... index.
Get Top N Values
nlargest()
top_3 = df.nlargest(3, 'Salary')
print(top_3)3 highest salaries.
nsmallest()
bottom_3 = df.nsmallest(3, 'Salary')
print(bottom_3)3 lowest salaries.
Faster than sort_values() + head()!
Sort with Ties
df = pd.DataFrame({
'Name': ['John', 'Sarah', 'Mike', 'Emma'],
'Score': [85, 90, 85, 90]
})
sorted_df = df.sort_values(['Score', 'Name'], ascending=[False, True])
print(sorted_df)Breaks ties by Name.
Sort by Computed Column
df['Score_Adjusted'] = df['Score'] * 1.1
sorted_df = df.sort_values('Score_Adjusted', ascending=False)
print(sorted_df)Sorting Performance
For small data: All methods work fine.
For large data:
- nlargest/nsmallest faster than sort + head
- sort_values with single column faster than multiple
- Sorting large DataFrames is memory intensive
Preserve Original Order
df['original_index'] = range(len(df))
df_sorted = df.sort_values('Age')
# Later restore
df_restored = df_sorted.sort_values('original_index')Sort by Absolute Value
df = pd.DataFrame({
'Value': [10, -50, 30, -20]
})
sorted_df = df.sort_values('Value', key=abs)
print(sorted_df)Sorts by absolute value, not actual value.
Key Points to Remember
sort_values('column') sorts by column values, ascending by default.
ascending=False for descending order (highest to lowest).
Sort multiple columns with list: sort_values(['Col1', 'Col2']).
inplace=True modifies original DataFrame. Default creates copy.
nlargest() and nsmallest() faster than sort + head for top/bottom N.
sort_index() sorts by index values, not column values.
Common Mistakes
Mistake 1: Forgetting assignment
df.sort_values('Age') # Doesn't change df!
df = df.sort_values('Age') # Correct
# OR
df.sort_values('Age', inplace=True) # Also correctMistake 2: Wrong column name
df.sort_values('age') # Error if column is 'Age'
print(df.columns.tolist()) # Check names firstMistake 3: Mixed ascending with single column
df.sort_values('Age', ascending=[True, False]) # Error!
df.sort_values('Age', ascending=False) # CorrectMistake 4: Not handling NaN
df.sort_values('Score') # NaN goes to end
# Specify if needed:
df.sort_values('Score', na_position='first')Mistake 5: Inefficient sorting
top_10 = df.sort_values('Score', ascending=False).head(10) # Slow
top_10 = df.nlargest(10, 'Score') # FastWhat's Next?
You now know how to sort DataFrames. Next, you'll learn about adding and removing columns - modifying DataFrame structure dynamically.