15 min read
EDA Workflow II
Bivariate and Multivariate analysis: Finding relationships and correlations
What You'll Learn
- Bivariate analysis (two variables)
- Correlation vs Causation
- Scatter plots and Line plots
- Multivariate analysis (3+ variables)
- Heatmaps
Bivariate Analysis
Analyzing the relationship between two variables.
Numerical vs Numerical:
code.py
import seaborn as sns
import matplotlib.pyplot as plt
df = sns.load_dataset('tips')
# Scatter Plot
sns.scatterplot(x='total_bill', y='tip', data=df)
plt.title('Bill vs Tip')
plt.show()
# Correlation
correlation = df['total_bill'].corr(df['tip'])
print(f"Correlation: {correlation:.2f}")Numerical vs Categorical:
code.py
# Box Plot (Distribution by category)
sns.boxplot(x='day', y='total_bill', data=df)
plt.title('Bill Distribution by Day')
plt.show()
# Bar Plot (Mean by category)
sns.barplot(x='sex', y='total_bill', data=df) # Shows mean with confidence interval
plt.show()Categorical vs Categorical:
code.py
# Cross Tabulation
ct = pd.crosstab(df['day'], df['sex'])
print(ct)
# Heatmap of counts
sns.heatmap(ct, annot=True, fmt='d', cmap='Blues')
plt.show()Multivariate Analysis
Adding a third (or fourth) dimension.
code.py
# Scatter plot with Color (Hue)
sns.scatterplot(x='total_bill', y='tip', hue='sex', data=df)
plt.title('Bill vs Tip by Sex')
plt.show()
# Scatter plot with Size
sns.scatterplot(x='total_bill', y='tip', size='size', data=df)
plt.show()
# Pair Plot (All numerical relationships)
sns.pairplot(df, hue='sex')
plt.show()Correlation Heatmap
Visualizing correlations between all numerical variables.
code.py
# Calculate correlation matrix
corr_matrix = df.corr(numeric_only=True)
# Plot heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', vmin=-1, vmax=1)
plt.title('Correlation Matrix')
plt.show()Practice Exercise
code.py
import seaborn as sns
# Load diamonds dataset
diamonds = sns.load_dataset('diamonds')
# 1. Correlation between price and carat
print("Correlation:", diamonds['price'].corr(diamonds['carat']))
# 2. Price distribution by cut (Boxplot)
# (Visualization code would go here)
# 3. Price vs Carat colored by Clarity
# (Visualization code would go here)Next Steps
Now that we understand our data, let's start modeling!
Practice & Experiment
Test your understanding by running Python code directly in your browser. Try the examples from the article above!