Filtering Data

Basic Filtering

Filter rows based on conditions.

code.py

import pandas as pd

df = pd.DataFrame({
    'Name': ['John', 'Sarah', 'Mike', 'Emma', 'David'],
    'Age': [25, 30, 28, 32, 27],
    'Salary': [50000, 75000, 55000, 80000, 60000],
    'City': ['NYC', 'LA', 'Chicago', 'NYC', 'LA']
})

print("All data:")
print(df)
print()

high_earners = df[df['Salary'] > 60000]
print("High earners:")
print(high_earners)

How it works:

df['Salary'] > 60000 creates True/False for each row
df[...] selects rows where True

Comparison Operators

code.py

equals = df[df['City'] == 'NYC']
not_equals = df[df['City'] != 'NYC']
greater = df[df['Age'] > 28]
greater_equal = df[df['Age'] >= 30]
less = df[df['Salary'] < 60000]
less_equal = df[df['Salary'] <= 55000]

Multiple Conditions

AND (&)

Both conditions must be true.

code.py

result = df[(df['Age'] > 25) & (df['Salary'] > 55000)]
print(result)

Important: Use & not 'and'. Use parentheses!

OR (|)

Either condition can be true.

code.py

result = df[(df['City'] == 'NYC') | (df['City'] == 'LA')]
print(result)

NOT (~)

Opposite of condition.

code.py

not_nyc = df[~(df['City'] == 'NYC')]
print(not_nyc)

Complex Conditions

code.py

result = df[
    ((df['Age'] >= 28) & (df['Salary'] > 50000)) |
    (df['City'] == 'NYC')
]
print(result)

What this means: (Age 28+ AND Salary 50000+) OR City is NYC

isin() Method

Check if value in list.

code.py

cities = ['NYC', 'LA']
result = df[df['City'].isin(cities)]
print(result)

Easier than multiple OR conditions!

Not in list:

code.py

result = df[~df['City'].isin(cities)]

between() Method

code.py

mid_salary = df[df['Salary'].between(50000, 70000)]
print(mid_salary)

Includes both endpoints.

Exclude endpoints:

code.py

result = df[df['Salary'].between(50000, 70000, inclusive='neither')]

String Filtering

Contains

code.py

df_emails = pd.DataFrame({
    'Email': ['john@gmail.com', 'sarah@yahoo.com', 'mike@gmail.com']
})

gmail_users = df_emails[df_emails['Email'].str.contains('gmail')]
print(gmail_users)

Starts with

code.py

df_names = pd.DataFrame({
    'Name': ['John', 'Jane', 'Mike', 'Mary']
})

j_names = df_names[df_names['Name'].str.startswith('J')]
print(j_names)

Ends with

code.py

com_emails = df_emails[df_emails['Email'].str.endswith('.com')]

Case-insensitive

code.py

result = df[df['City'].str.lower() == 'nyc']

Null Value Filtering

Filter null rows

code.py

df_with_nulls = pd.DataFrame({
    'A': [1, 2, None, 4],
    'B': [5, None, 7, 8]
})

has_null_a = df_with_nulls[df_with_nulls['A'].isnull()]
print("Rows with null in A:")
print(has_null_a)

Filter non-null rows

code.py

no_null_a = df_with_nulls[df_with_nulls['A'].notnull()]
print("Rows without null in A:")
print(no_null_a)

Drop any nulls

code.py

clean = df_with_nulls.dropna()
print(clean)

Practice Example

The scenario: Analyze employee database with filters.

code.py

import pandas as pd

employees = pd.DataFrame({
    'Name': ['John', 'Sarah', 'Mike', 'Emma', 'David', 'Lisa', 'Tom', 'Anna'],
    'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT'],
    'Salary': [50000, 85000, 55000, 65000, 90000, 52000, 60000, 80000],
    'Years': [3, 8, 4, 6, 10, 2, 5, 7],
    'Remote': [False, True, False, True, True, False, True, True]
})

print("All employees:")
print(employees)
print()

print("1. IT Department:")
it_dept = employees[employees['Department'] == 'IT']
print(it_dept)
print()

print("2. High earners (70000+):")
high_earners = employees[employees['Salary'] >= 70000]
print(high_earners)
print()

print("3. Sales OR HR:")
sales_hr = employees[employees['Department'].isin(['Sales', 'HR'])]
print(sales_hr)
print()

print("4. Remote workers with 5+ years:")
experienced_remote = employees[
    (employees['Remote'] == True) & (employees['Years'] >= 5)
]
print(experienced_remote)
print()

print("5. Mid-range salary (55k-75k):")
mid_range = employees[employees['Salary'].between(55000, 75000)]
print(mid_range)
print()

print("6. Not in Sales, salary > 60k:")
result = employees[
    (employees['Department'] != 'Sales') & (employees['Salary'] > 60000)
]
print(result)
print()

print("7. Names starting with 'S' or 'E':")
names_s_or_e = employees[
    employees['Name'].str.startswith('S') | employees['Name'].str.startswith('E')
]
print(names_s_or_e)
print()

print("Summary:")
print("Total IT staff:", len(it_dept))
print("Average salary (high earners):", high_earners['Salary'].mean())
print("Remote workers:", employees['Remote'].sum())

Query Method

Alternative syntax.

code.py

result = df.query('Age > 28 and Salary > 50000')
print(result)

Same as:

code.py

result = df[(df['Age'] > 28) & (df['Salary'] > 50000)]

With variables:

code.py

min_age = 28
result = df.query('Age > @min_age')

Filter with Index

code.py

df_indexed = df.set_index('Name')
result = df_indexed.loc[['John', 'Sarah']]
print(result)

Filter and Select Columns

code.py

names = df[df['Salary'] > 60000]['Name']
print(names)

Multiple columns:

code.py

subset = df[df['Salary'] > 60000][['Name', 'Salary']]
print(subset)

Count Filtered Rows

code.py

count = len(df[df['Age'] > 28])
print("Rows where age > 28:", count)

count = (df['Salary'] > 60000).sum()
print("High earners:", count)

Percentage of Filtered Rows

code.py

pct = (df['Salary'] > 60000).mean() * 100
print("Percent high earners:", pct)

Filter with Functions

code.py

def is_senior(row):
    return row['Age'] > 30 and row['Years'] > 5

seniors = df[df.apply(is_senior, axis=1)]
print(seniors)

Key Points to Remember

Use comparison operators with brackets: df[df['Age'] > 25]

Multiple conditions need & (AND), | (OR), ~ (NOT) with parentheses.

isin() checks if value in list. Better than multiple OR conditions.

between() checks range. Includes both endpoints by default.

String methods need .str: df['Name'].str.contains('text')

Common Mistakes

Mistake 1: Using 'and' instead of &

code.py

df[(df['Age'] > 25) and (df['Salary'] > 50000)]  # Error!
df[(df['Age'] > 25) & (df['Salary'] > 50000)]  # Correct

Mistake 2: Missing parentheses

code.py

df[df['Age'] > 25 & df['Salary'] > 50000]  # Error!
df[(df['Age'] > 25) & (df['Salary'] > 50000)]  # Correct

Mistake 3: Single = instead of ==

code.py

df[df['City'] = 'NYC']  # Error!
df[df['City'] == 'NYC']  # Correct

Mistake 4: Forgetting .str for strings

code.py

df[df['Name'].contains('John')]  # Error!
df[df['Name'].str.contains('John')]  # Correct

Mistake 5: Not handling NaN in string operations

code.py

df[df['Name'].str.contains('John')]  # Error if NaN exists
df[df['Name'].str.contains('John', na=False)]  # Correct

What's Next?

You now know how to filter data with conditions. Next, you'll learn about conditional filtering - advanced filtering with multiple conditions and custom logic.