#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
5 min read min read

Regular Expressions

Learn to find and extract complex text patterns

Regular Expressions

What are Regular Expressions?

Regular expressions (regex) are patterns to find text. They're like super-powered search.

Instead of finding exact text like "cat", you can find:

  • Any 3-letter word
  • All phone numbers
  • Any email address

Basic Pattern Matching

code.py
import pandas as pd

df = pd.DataFrame({
    'Text': ['cat', 'car', 'cut', 'dog']
})

# Find words starting with 'c'
matches = df[df['Text'].str.contains('^c', regex=True)]
print(matches)

Output:

Text 0 cat 1 car 2 cut

Common Patterns

PatternMeaningExample
^Start of text^hello
$End of textworld$
.Any single characterc.t matches cat, cut
\dAny digit (0-9)\d\d\d matches 123
\wAny letter or number\w+ matches hello
\sAny spacehello\sworld
+One or more\d+ matches 1, 12, 123
*Zero or moreab* matches a, ab, abb

Find Numbers

code.py
df = pd.DataFrame({
    'Text': ['Order 123', 'Item 456', 'No number here']
})

# Extract numbers
df['Numbers'] = df['Text'].str.extract('(\d+)')
print(df)

Output:

Text Numbers 0 Order 123 123 1 Item 456 456 2 No number here NaN

Find Email Addresses

code.py
df = pd.DataFrame({
    'Text': ['Contact: john@gmail.com', 'Email sarah@yahoo.com', 'No email']
})

# Extract emails
df['Email'] = df['Text'].str.extract('(\w+@\w+\.\w+)')
print(df)

Output:

Text Email 0 Contact: john@gmail.com john@gmail.com 1 Email sarah@yahoo.com sarah@yahoo.com 2 No email NaN

Find Phone Numbers

code.py
df = pd.DataFrame({
    'Text': ['Call 123-456-7890', 'Phone: 987-654-3210']
})

# Extract phone numbers
df['Phone'] = df['Text'].str.extract('(\d{3}-\d{3}-\d{4})')
print(df)

\d{3} means exactly 3 digits.

Replace Using Patterns

code.py
df = pd.DataFrame({
    'Phone': ['123-456-7890', '987-654-3210']
})

# Remove all non-digits
df['Clean'] = df['Phone'].str.replace('\D', '', regex=True)
print(df)

Output:

Phone Clean 0 123-456-7890 1234567890 1 987-654-3210 9876543210

\D means "not a digit"

Check Pattern Exists

code.py
df = pd.DataFrame({
    'Text': ['abc123', 'hello', '456def']
})

# Check if contains any digit
df['Has_Number'] = df['Text'].str.contains('\d', regex=True)
print(df)

Useful Patterns Cheat Sheet

What to FindPattern
Any number\d+
Any word\w+
Email\w+@\w+.\w+
Phone (US)\d{3}-\d{3}-\d{4}
Start with letter^[A-Za-z]
End with number\d$

Key Points

  • Regex finds patterns, not exact text
  • Use str.extract() to pull out matches
  • Use str.contains() to check if pattern exists
  • Use str.replace() with regex to clean data
  • Patterns look complex but follow simple rules

What's Next?

Learn to work with dates and times in your data.

SkillsetMaster - AI, Web Development & Data Analytics Courses