5 min read min read
Regular Expressions
Learn to find and extract complex text patterns
Regular Expressions
What are Regular Expressions?
Regular expressions (regex) are patterns to find text. They're like super-powered search.
Instead of finding exact text like "cat", you can find:
- Any 3-letter word
- All phone numbers
- Any email address
Basic Pattern Matching
code.py
import pandas as pd
df = pd.DataFrame({
'Text': ['cat', 'car', 'cut', 'dog']
})
# Find words starting with 'c'
matches = df[df['Text'].str.contains('^c', regex=True)]
print(matches)Output:
Text
0 cat
1 car
2 cut
Common Patterns
| Pattern | Meaning | Example |
|---|---|---|
| ^ | Start of text | ^hello |
| $ | End of text | world$ |
| . | Any single character | c.t matches cat, cut |
| \d | Any digit (0-9) | \d\d\d matches 123 |
| \w | Any letter or number | \w+ matches hello |
| \s | Any space | hello\sworld |
| + | One or more | \d+ matches 1, 12, 123 |
| * | Zero or more | ab* matches a, ab, abb |
Find Numbers
code.py
df = pd.DataFrame({
'Text': ['Order 123', 'Item 456', 'No number here']
})
# Extract numbers
df['Numbers'] = df['Text'].str.extract('(\d+)')
print(df)Output:
Text Numbers
0 Order 123 123
1 Item 456 456
2 No number here NaN
Find Email Addresses
code.py
df = pd.DataFrame({
'Text': ['Contact: john@gmail.com', 'Email sarah@yahoo.com', 'No email']
})
# Extract emails
df['Email'] = df['Text'].str.extract('(\w+@\w+\.\w+)')
print(df)Output:
Text Email
0 Contact: john@gmail.com john@gmail.com
1 Email sarah@yahoo.com sarah@yahoo.com
2 No email NaN
Find Phone Numbers
code.py
df = pd.DataFrame({
'Text': ['Call 123-456-7890', 'Phone: 987-654-3210']
})
# Extract phone numbers
df['Phone'] = df['Text'].str.extract('(\d{3}-\d{3}-\d{4})')
print(df)\d{3} means exactly 3 digits.
Replace Using Patterns
code.py
df = pd.DataFrame({
'Phone': ['123-456-7890', '987-654-3210']
})
# Remove all non-digits
df['Clean'] = df['Phone'].str.replace('\D', '', regex=True)
print(df)Output:
Phone Clean
0 123-456-7890 1234567890
1 987-654-3210 9876543210
\D means "not a digit"
Check Pattern Exists
code.py
df = pd.DataFrame({
'Text': ['abc123', 'hello', '456def']
})
# Check if contains any digit
df['Has_Number'] = df['Text'].str.contains('\d', regex=True)
print(df)Useful Patterns Cheat Sheet
| What to Find | Pattern |
|---|---|
| Any number | \d+ |
| Any word | \w+ |
| \w+@\w+.\w+ | |
| Phone (US) | \d{3}-\d{3}-\d{4} |
| Start with letter | ^[A-Za-z] |
| End with number | \d$ |
Key Points
- Regex finds patterns, not exact text
- Use str.extract() to pull out matches
- Use str.contains() to check if pattern exists
- Use str.replace() with regex to clean data
- Patterns look complex but follow simple rules
What's Next?
Learn to work with dates and times in your data.