Introduction to Pandas
Learn what Pandas is and why it is essential for data analysis
Introduction to Pandas
What is Pandas?
Pandas is Python's most popular library for working with data. Think of it like Excel, but more powerful and automated.
The name comes from "Panel Data" - a term for multi-dimensional data.
Why Pandas is essential:
- Works with tables of data (like spreadsheets)
- Clean and analyze data easily
- Read/write CSV, Excel, SQL files
- Standard tool in data science
What Makes Pandas Special?
Built on NumPy:
- Fast and efficient
- Handles millions of rows
- Less memory than pure Python
Easy to use:
- Simple, readable syntax
- Works like spreadsheet formulas
- Intuitive for beginners
Powerful features:
- Filter and sort data
- Group and summarize
- Handle missing values
- Merge datasets
Installing Pandas
pip install pandas
Check installation:
import pandas as pd
print(pd.__version__)Pandas Data Structures
Pandas has two main structures:
Series (1D)
A Series is like a single column of data.
import pandas as pd
prices = pd.Series([100, 200, 300])
print(prices)Output:
0 100
1 200
2 300
dtype: int64
Think of it as one column from a spreadsheet.
DataFrame (2D)
A DataFrame is like a full spreadsheet with rows and columns.
import pandas as pd
data = {
'Product': ['Laptop', 'Phone', 'Tablet'],
'Price': [999, 599, 399]
}
df = pd.DataFrame(data)
print(df)Output:
Product Price
0 Laptop 999
1 Phone 599
2 Tablet 399
This is what you'll use most of the time.
Why Use Pandas Instead of Lists?
Python lists:
names = ['John', 'Sarah', 'Mike']
ages = [25, 30, 28]
cities = ['NYC', 'LA', 'Chicago']
# Hard to work with related data
# Need multiple lists
# No built-in analysis toolsPandas DataFrame:
import pandas as pd
df = pd.DataFrame({
'Name': ['John', 'Sarah', 'Mike'],
'Age': [25, 30, 28],
'City': ['NYC', 'LA', 'Chicago']
})
# All data organized together
# Easy to filter, sort, analyze
# Powerful built-in functionsReal-World Example
The scenario: Analyze sales data.
import pandas as pd
sales = pd.DataFrame({
'Date': ['2024-01-01', '2024-01-02', '2024-01-03'],
'Product': ['Laptop', 'Phone', 'Tablet'],
'Quantity': [5, 10, 7],
'Price': [999, 599, 399]
})
print(sales)
print()
sales['Total'] = sales['Quantity'] * sales['Price']
print("With totals:")
print(sales)
print()
print("Total revenue:", sales['Total'].sum())
print("Average price:", sales['Price'].mean())
print("Best selling:", sales.loc[sales['Quantity'].idxmax(), 'Product'])What this does:
- Creates sales data table
- Calculates total for each row
- Shows total revenue
- Calculates average price
- Finds best-selling product
All in just a few lines!
Common Pandas Operations
Reading Data
import pandas as pd
df = pd.read_csv('data.csv')
df = pd.read_excel('data.xlsx')
df = pd.read_sql(query, connection)Quick Look
print(df.head())
print(df.info())
print(df.describe())Filtering
expensive = df[df['Price'] > 500]Sorting
sorted_df = df.sort_values('Price')Grouping
by_category = df.groupby('Category').sum()Pandas vs Excel
| Feature | Excel | Pandas |
|---|---|---|
| Data size | Limited rows | Millions of rows |
| Speed | Slow with big data | Very fast |
| Automation | Manual clicks | Write once, run always |
| Reproducibility | Hard to track changes | Code documents everything |
| Advanced analysis | Limited | Unlimited |
Use Excel for:
- Quick manual tasks
- Sharing with non-programmers
- Simple data entry
Use Pandas for:
- Large datasets
- Repetitive tasks
- Complex analysis
- Automated reports
When to Use Pandas
Perfect for:
- Analyzing CSV/Excel files
- Cleaning messy data
- Combining multiple datasets
- Statistical analysis
- Preparing data for machine learning
Examples:
- Sales analysis
- Survey results
- Financial data
- Scientific experiments
- Web scraping results
Import Convention
Always import Pandas as "pd":
import pandas as pdWhy:
- Shorter to type
- Standard convention
- Everyone does this
Key Points to Remember
Pandas is Python's main library for data analysis. Built on NumPy for speed.
DataFrame is the primary structure - like a spreadsheet with rows and columns.
Series is a single column. DataFrames are collections of Series.
Pandas can read CSV, Excel, SQL, and many other formats easily.
Much more powerful than Excel for large datasets and automation.
Common Mistakes
Mistake 1: Not importing
df = DataFrame() # Error! No DataFrame without importFix:
import pandas as pd
df = pd.DataFrame()Mistake 2: Wrong import name
import pandas
df = pd.DataFrame() # Error! Use pandas or import as pdMistake 3: Using lists when DataFrame is better
names = []
ages = []
# Hard to manage related dataBetter:
df = pd.DataFrame({'Name': names, 'Age': ages})Mistake 4: Not checking data first
df = pd.read_csv('data.csv')
# Process without lookingAlways check:
print(df.head())
print(df.info())What's Next?
You now understand what Pandas is and why it's important. Next, you'll learn about creating DataFrames - different ways to build DataFrames from various data sources.