12 min read
EDA Workflow I
Univariate analysis: Understanding single variables through statistics and visualization
What You'll Learn
- What is EDA?
- Univariate analysis (one variable)
- Measures of central tendency (mean, median, mode)
- Measures of dispersion (range, variance, std dev)
- Visualizing distributions
What is EDA?
Exploratory Data Analysis (EDA) is the process of analyzing datasets to summarize their main characteristics, often using statistical graphics and other data visualization methods.
Goals of EDA:
- Understand the data structure
- Identify missing values and outliers
- Discover patterns and relationships
- Test hypotheses
Univariate Analysis
Analyzing one variable at a time.
Numerical Variables:
code.py
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Load dataset
df = sns.load_dataset('titanic')
# Summary statistics
print(df['age'].describe())
# Histogram (Distribution)
sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')
plt.show()
# Box Plot (Outliers)
sns.boxplot(x=df['age'])
plt.title('Age Box Plot')
plt.show()Categorical Variables:
code.py
# Value counts
print(df['class'].value_counts())
# Percentage
print(df['class'].value_counts(normalize=True) * 100)
# Bar Chart
sns.countplot(x='class', data=df)
plt.title('Passenger Class Count')
plt.show()Key Statistics
-
Central Tendency: Where is the center?
- Mean: Average (sensitive to outliers)
- Median: Middle value (robust to outliers)
- Mode: Most frequent value
-
Dispersion: How spread out is the data?
- Range: Max - Min
- Variance: Average squared deviation from mean
- Standard Deviation: Square root of variance (same units as data)
- IQR (Interquartile Range): 75th percentile - 25th percentile
Practice Exercise
code.py
import seaborn as sns
import pandas as pd
# Load tips dataset
tips = sns.load_dataset('tips')
# 1. Analyze 'total_bill' (Numerical)
print("Mean bill:", tips['total_bill'].mean())
print("Median bill:", tips['total_bill'].median())
# 2. Analyze 'day' (Categorical)
print("\nBusiest day:\n", tips['day'].mode()[0])
print("\nDay counts:\n", tips['day'].value_counts())Next Steps
Now let's look at relationships between multiple variables!
Practice & Experiment
Test your understanding by running Python code directly in your browser. Try the examples from the article above!