Filtering Data
Learn to filter DataFrames based on conditions and values
Filtering Data
Basic Filtering
Filter rows based on conditions.
import pandas as pd
df = pd.DataFrame({
'Name': ['John', 'Sarah', 'Mike', 'Emma', 'David'],
'Age': [25, 30, 28, 32, 27],
'Salary': [50000, 75000, 55000, 80000, 60000],
'City': ['NYC', 'LA', 'Chicago', 'NYC', 'LA']
})
print("All data:")
print(df)
print()
high_earners = df[df['Salary'] > 60000]
print("High earners:")
print(high_earners)How it works:
- df['Salary'] > 60000 creates True/False for each row
- df[...] selects rows where True
Comparison Operators
equals = df[df['City'] == 'NYC']
not_equals = df[df['City'] != 'NYC']
greater = df[df['Age'] > 28]
greater_equal = df[df['Age'] >= 30]
less = df[df['Salary'] < 60000]
less_equal = df[df['Salary'] <= 55000]Multiple Conditions
AND (&)
Both conditions must be true.
result = df[(df['Age'] > 25) & (df['Salary'] > 55000)]
print(result)Important: Use & not 'and'. Use parentheses!
OR (|)
Either condition can be true.
result = df[(df['City'] == 'NYC') | (df['City'] == 'LA')]
print(result)NOT (~)
Opposite of condition.
not_nyc = df[~(df['City'] == 'NYC')]
print(not_nyc)Complex Conditions
result = df[
((df['Age'] >= 28) & (df['Salary'] > 50000)) |
(df['City'] == 'NYC')
]
print(result)What this means: (Age 28+ AND Salary 50000+) OR City is NYC
isin() Method
Check if value in list.
cities = ['NYC', 'LA']
result = df[df['City'].isin(cities)]
print(result)Easier than multiple OR conditions!
Not in list:
result = df[~df['City'].isin(cities)]between() Method
mid_salary = df[df['Salary'].between(50000, 70000)]
print(mid_salary)Includes both endpoints.
Exclude endpoints:
result = df[df['Salary'].between(50000, 70000, inclusive='neither')]String Filtering
Contains
df_emails = pd.DataFrame({
'Email': ['john@gmail.com', 'sarah@yahoo.com', 'mike@gmail.com']
})
gmail_users = df_emails[df_emails['Email'].str.contains('gmail')]
print(gmail_users)Starts with
df_names = pd.DataFrame({
'Name': ['John', 'Jane', 'Mike', 'Mary']
})
j_names = df_names[df_names['Name'].str.startswith('J')]
print(j_names)Ends with
com_emails = df_emails[df_emails['Email'].str.endswith('.com')]Case-insensitive
result = df[df['City'].str.lower() == 'nyc']Null Value Filtering
Filter null rows
df_with_nulls = pd.DataFrame({
'A': [1, 2, None, 4],
'B': [5, None, 7, 8]
})
has_null_a = df_with_nulls[df_with_nulls['A'].isnull()]
print("Rows with null in A:")
print(has_null_a)Filter non-null rows
no_null_a = df_with_nulls[df_with_nulls['A'].notnull()]
print("Rows without null in A:")
print(no_null_a)Drop any nulls
clean = df_with_nulls.dropna()
print(clean)Practice Example
The scenario: Analyze employee database with filters.
import pandas as pd
employees = pd.DataFrame({
'Name': ['John', 'Sarah', 'Mike', 'Emma', 'David', 'Lisa', 'Tom', 'Anna'],
'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT'],
'Salary': [50000, 85000, 55000, 65000, 90000, 52000, 60000, 80000],
'Years': [3, 8, 4, 6, 10, 2, 5, 7],
'Remote': [False, True, False, True, True, False, True, True]
})
print("All employees:")
print(employees)
print()
print("1. IT Department:")
it_dept = employees[employees['Department'] == 'IT']
print(it_dept)
print()
print("2. High earners (70000+):")
high_earners = employees[employees['Salary'] >= 70000]
print(high_earners)
print()
print("3. Sales OR HR:")
sales_hr = employees[employees['Department'].isin(['Sales', 'HR'])]
print(sales_hr)
print()
print("4. Remote workers with 5+ years:")
experienced_remote = employees[
(employees['Remote'] == True) & (employees['Years'] >= 5)
]
print(experienced_remote)
print()
print("5. Mid-range salary (55k-75k):")
mid_range = employees[employees['Salary'].between(55000, 75000)]
print(mid_range)
print()
print("6. Not in Sales, salary > 60k:")
result = employees[
(employees['Department'] != 'Sales') & (employees['Salary'] > 60000)
]
print(result)
print()
print("7. Names starting with 'S' or 'E':")
names_s_or_e = employees[
employees['Name'].str.startswith('S') | employees['Name'].str.startswith('E')
]
print(names_s_or_e)
print()
print("Summary:")
print("Total IT staff:", len(it_dept))
print("Average salary (high earners):", high_earners['Salary'].mean())
print("Remote workers:", employees['Remote'].sum())Query Method
Alternative syntax.
result = df.query('Age > 28 and Salary > 50000')
print(result)Same as:
result = df[(df['Age'] > 28) & (df['Salary'] > 50000)]With variables:
min_age = 28
result = df.query('Age > @min_age')Filter with Index
df_indexed = df.set_index('Name')
result = df_indexed.loc[['John', 'Sarah']]
print(result)Filter and Select Columns
names = df[df['Salary'] > 60000]['Name']
print(names)Multiple columns:
subset = df[df['Salary'] > 60000][['Name', 'Salary']]
print(subset)Count Filtered Rows
count = len(df[df['Age'] > 28])
print("Rows where age > 28:", count)
count = (df['Salary'] > 60000).sum()
print("High earners:", count)Percentage of Filtered Rows
pct = (df['Salary'] > 60000).mean() * 100
print("Percent high earners:", pct)Filter with Functions
def is_senior(row):
return row['Age'] > 30 and row['Years'] > 5
seniors = df[df.apply(is_senior, axis=1)]
print(seniors)Key Points to Remember
Use comparison operators with brackets: df[df['Age'] > 25]
Multiple conditions need & (AND), | (OR), ~ (NOT) with parentheses.
isin() checks if value in list. Better than multiple OR conditions.
between() checks range. Includes both endpoints by default.
String methods need .str: df['Name'].str.contains('text')
Common Mistakes
Mistake 1: Using 'and' instead of &
df[(df['Age'] > 25) and (df['Salary'] > 50000)] # Error!
df[(df['Age'] > 25) & (df['Salary'] > 50000)] # CorrectMistake 2: Missing parentheses
df[df['Age'] > 25 & df['Salary'] > 50000] # Error!
df[(df['Age'] > 25) & (df['Salary'] > 50000)] # CorrectMistake 3: Single = instead of ==
df[df['City'] = 'NYC'] # Error!
df[df['City'] == 'NYC'] # CorrectMistake 4: Forgetting .str for strings
df[df['Name'].contains('John')] # Error!
df[df['Name'].str.contains('John')] # CorrectMistake 5: Not handling NaN in string operations
df[df['Name'].str.contains('John')] # Error if NaN exists
df[df['Name'].str.contains('John', na=False)] # CorrectWhat's Next?
You now know how to filter data with conditions. Next, you'll learn about conditional filtering - advanced filtering with multiple conditions and custom logic.