Python Statistical Analysis
Statistical analysis with Python libraries
What You'll Learn
- Essential Python libraries for statistics
- NumPy and Pandas basics
- Statistical analysis with SciPy
- Visualization with Matplotlib and Seaborn
- Complete analysis workflows
Essential Libraries
Install:
pip install numpy pandas scipy matplotlib seaborn statsmodels scikit-learn
Core libraries:
NumPy: Numerical computing, arrays
Pandas: Data manipulation, DataFrames
SciPy: Statistical functions
Statsmodels: Statistical models, tests
Matplotlib/Seaborn: Visualization
Scikit-learn: Machine learning (includes some stats)
NumPy Basics
Import:
import numpy as np
Create arrays:
data = np.array([1, 2, 3, 4, 5])
data_2d = np.array([[1, 2], [3, 4], [5, 6]])
Basic statistics:
np.mean(data) # Mean
np.median(data) # Median
np.std(data) # Standard deviation (population)
np.std(data, ddof=1) # Sample std dev
np.var(data) # Variance
np.min(data) # Minimum
np.max(data) # Maximum
np.percentile(data, 75) # 75th percentile
Useful functions:
np.sum(data) # Sum
np.cumsum(data) # Cumulative sum
np.diff(data) # Differences
np.corrcoef(x, y) # Correlation coefficient
Pandas Basics
Import:
import pandas as pd
Create DataFrame:
df = pd.DataFrame({
'sales': [100, 110, 105, 115, 120],
'advertising': [10, 12, 11, 13, 14],
'region': ['North', 'South', 'North', 'West', 'South']
})
Read data:
df = pd.read_csv('data.csv')
df = pd.read_excel('data.xlsx')
Basic info:
df.head() # First 5 rows
df.info() # Data types, null counts
df.describe() # Summary statistics
df.shape # (rows, columns)
df.columns # Column names
Select data:
df['sales'] # One column (Series)
df[['sales', 'advertising']] # Multiple columns
df.iloc[0] # First row by position
df.loc[0] # First row by label
df[df['sales'] > 100] # Filter rows
Descriptive Statistics with Pandas
Summary statistics:
df.describe() # All numeric columns
# Individual metrics
df['sales'].mean()
df['sales'].median()
df['sales'].std()
df['sales'].var()
df['sales'].min()
df['sales'].max()
df['sales'].quantile(0.25) # Q1
df['sales'].quantile(0.75) # Q3
Group statistics:
df.groupby('region')['sales'].mean()
df.groupby('region').agg({
'sales': ['mean', 'std', 'count']
})
Correlation matrix:
df.corr() # All numeric columns
df[['sales', 'advertising']].corr()
Statistical Tests with SciPy
Import:
from scipy import stats
t-Tests:
# One-sample t-test
stats.ttest_1samp(data, popmean=100)
# Two-sample t-test (independent)
stats.ttest_ind(group1, group2)
# Paired t-test
stats.ttest_rel(before, after)
# Returns: (t-statistic, p-value)
Example:
group_a = [23, 25, 27, 29, 31]
group_b = [20, 22, 24, 26, 28]
t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Significant difference!")
else:
print("No significant difference")
Other tests:
# Chi-square test
stats.chi2_contingency(contingency_table)
# Pearson correlation
stats.pearsonr(x, y)
# Spearman correlation
stats.spearmanr(x, y)
# Normality test
stats.shapiro(data)
# ANOVA
stats.f_oneway(group1, group2, group3)
Linear Regression
Simple linear regression:
from scipy.stats import linregress
x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]
slope, intercept, r_value, p_value, std_err = linregress(x, y)
print(f"Slope: {slope:.4f}")
print(f"Intercept: {intercept:.4f}")
print(f"R-squared: {r_value**2:.4f}")
print(f"p-value: {p_value:.4f}")
# Make predictions
y_pred = slope * np.array(x) + intercept
Multiple regression with statsmodels:
import statsmodels.api as sm
# Prepare data
X = df[['advertising', 'price']]
y = df['sales']
# Add constant (intercept)
X = sm.add_constant(X)
# Fit model
model = sm.OLS(y, X).fit()
# Summary
print(model.summary())
# Predictions
predictions = model.predict(X)
# Coefficients
print(model.params)
# R-squared
print(model.rsquared)
Regression with scikit-learn:
from sklearn.linear_model import LinearRegression
X = df[['advertising', 'price']]
y = df['sales']
model = LinearRegression()
model.fit(X, y)
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")
print(f"R-squared: {model.score(X, y)}")
# Predict
predictions = model.predict(X)
Visualization with Matplotlib
Import:
import matplotlib.pyplot as plt
Basic plots:
# Line plot
plt.plot(x, y)
plt.xlabel('X label')
plt.ylabel('Y label')
plt.title('Title')
plt.show()
# Scatter plot
plt.scatter(x, y)
plt.show()
# Histogram
plt.hist(data, bins=10)
plt.show()
# Box plot
plt.boxplot(data)
plt.show()
# Bar chart
plt.bar(categories, values)
plt.show()
Multiple subplots:
fig, axes = plt.subplots(2, 2, figsize=(10, 8))
axes[0, 0].plot(x, y)
axes[0, 0].set_title('Line Plot')
axes[0, 1].scatter(x, y)
axes[0, 1].set_title('Scatter Plot')
axes[1, 0].hist(data, bins=10)
axes[1, 0].set_title('Histogram')
axes[1, 1].boxplot(data)
axes[1, 1].set_title('Box Plot')
plt.tight_layout()
plt.show()
Visualization with Seaborn
Import:
import seaborn as sns
sns.set_theme() # Better default styling
Distribution plots:
# Histogram with KDE
sns.histplot(data, kde=True)
# Box plot
sns.boxplot(data=df, x='region', y='sales')
# Violin plot
sns.violinplot(data=df, x='region', y='sales')
Relationship plots:
# Scatter with regression line
sns.regplot(x='advertising', y='sales', data=df)
# Scatter plot matrix
sns.pairplot(df)
# Correlation heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
Categorical plots:
# Bar plot with error bars
sns.barplot(data=df, x='region', y='sales')
# Count plot
sns.countplot(data=df, x='region')
Time Series Analysis
Date handling:
df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)
Resampling:
# Monthly average
df_monthly = df.resample('M').mean()
# Quarterly sum
df_quarterly = df.resample('Q').sum()
Moving average:
df['MA_3'] = df['sales'].rolling(window=3).mean()
Exponential smoothing:
df['EXP'] = df['sales'].ewm(alpha=0.3, adjust=False).mean()
Decomposition:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(df['sales'], model='additive', period=12)
fig, axes = plt.subplots(4, 1, figsize=(10, 8))
decomposition.observed.plot(ax=axes[0])
decomposition.trend.plot(ax=axes[1])
decomposition.seasonal.plot(ax=axes[2])
decomposition.resid.plot(ax=axes[3])
plt.show()
Complete Analysis Workflow
Sales analysis example:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm
# 1. Load data
df = pd.read_csv('sales_data.csv')
# 2. Initial exploration
print(df.head())
print(df.info())
print(df.describe())
# 3. Data cleaning
df.dropna(inplace=True) # Remove missing
df = df[df['sales'] > 0] # Remove invalid
# 4. Descriptive statistics
print("\nSales Statistics:")
print(f"Mean: {df['sales'].mean():.2f}")
print(f"Median: {df['sales'].median():.2f}")
print(f"Std Dev: {df['sales'].std():.2f}")
# 5. Group analysis
regional_stats = df.groupby('region')['sales'].agg(['mean', 'std', 'count'])
print("\nRegional Statistics:")
print(regional_stats)
# 6. Visualizations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
# Distribution
axes[0, 0].hist(df['sales'], bins=20, edgecolor='black')
axes[0, 0].set_title('Sales Distribution')
axes[0, 0].set_xlabel('Sales')
# Box plot by region
df.boxplot(column='sales', by='region', ax=axes[0, 1])
axes[0, 1].set_title('Sales by Region')
# Scatter: Sales vs Advertising
axes[1, 0].scatter(df['advertising'], df['sales'])
axes[1, 0].set_xlabel('Advertising')
axes[1, 0].set_ylabel('Sales')
axes[1, 0].set_title('Sales vs Advertising')
# Correlation heatmap
corr_data = df[['sales', 'advertising', 'price']].corr()
sns.heatmap(corr_data, annot=True, cmap='coolwarm', ax=axes[1, 1])
plt.tight_layout()
plt.show()
# 7. Statistical tests
# Compare regions
north = df[df['region'] == 'North']['sales']
south = df[df['region'] == 'South']['sales']
t_stat, p_value = stats.ttest_ind(north, south)
print(f"\nt-test North vs South: p-value = {p_value:.4f}")
# 8. Regression analysis
X = df[['advertising', 'price']]
y = df['sales']
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print("\nRegression Results:")
print(model.summary())
# 9. Predictions
new_data = pd.DataFrame({
'const': [1],
'advertising': [15],
'price': [25]
})
prediction = model.predict(new_data)
print(f"\nPredicted sales: {prediction.values[0]:.2f}")
A/B Test Analysis
Complete A/B test:
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
# Data
df = pd.read_csv('ab_test.csv')
# Separate groups
control = df[df['group'] == 'A']['conversion']
treatment = df[df['group'] == 'B']['conversion']
# Summary statistics
print(f"Control: n={len(control)}, mean={control.mean():.4f}")
print(f"Treatment: n={len(treatment)}, mean={treatment.mean():.4f}")
print(f"Lift: {(treatment.mean() - control.mean()) / control.mean() * 100:.2f}%")
# Statistical test
t_stat, p_value = stats.ttest_ind(control, treatment)
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")
if p_value < 0.05:
print("Result: SIGNIFICANT")
else:
print("Result: NOT SIGNIFICANT")
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Distributions
axes[0].hist(control, alpha=0.5, label='Control', bins=20)
axes[0].hist(treatment, alpha=0.5, label='Treatment', bins=20)
axes[0].legend()
axes[0].set_xlabel('Conversion Rate')
axes[0].set_title('Distribution')
# Box plot
axes[1].boxplot([control, treatment], labels=['Control', 'Treatment'])
axes[1].set_ylabel('Conversion Rate')
axes[1].set_title('Comparison')
plt.tight_layout()
plt.show()
Tips and Best Practices
1. Jupyter notebooks: Best for exploratory analysis
pip install jupyter
jupyter notebook
2. Virtual environments:
python -m venv myenv
source myenv/bin/activate # Mac/Linux
myenv\Scripts\activate # Windows
3. Save/load models:
import pickle
pickle.dump(model, open('model.pkl', 'wb'))
model = pickle.load(open('model.pkl', 'rb'))
4. Suppress warnings:
import warnings
warnings.filterwarnings('ignore')
5. Set random seed:
np.random.seed(42) # Reproducibility
Common Mistakes
1. Not checking data types:
df.dtypes # Always check!
2. Forgetting to handle missing values:
df.isnull().sum() # Count nulls
df.dropna() # Remove
df.fillna(0) # Fill with 0
3. Using wrong ddof: NumPy default: ddof=0 (population) Pandas default: ddof=1 (sample)
4. Not resetting index after filtering:
df.reset_index(drop=True, inplace=True)
5. Mixing libraries unnecessarily: Pick one approach and stick with it
Resources
Documentation:
- NumPy: numpy.org
- Pandas: pandas.pydata.org
- SciPy: scipy.org
- Matplotlib: matplotlib.org
- Seaborn: seaborn.pydata.org
Learning:
- Python for Data Analysis (book by Wes McKinney)
- Kaggle Learn courses
- Real Python tutorials
Practice Exercise
Dataset: Student scores Create DataFrame with: student_id, hours_studied, previous_score, final_score
Tasks:
- Load and explore data
- Calculate descriptive statistics
- Check correlation between variables
- Build regression model: final_score ~ hours_studied + previous_score
- Visualize relationships
- Test if hours_studied significantly affects scores
Solution template provided in course materials
Congratulations!
You've completed the Statistics course!
You now know: ✓ Descriptive statistics ✓ Probability and distributions ✓ Hypothesis testing ✓ Regression analysis ✓ A/B testing ✓ Time series forecasting ✓ Excel and Python implementation
Next steps:
- Practice on real datasets
- Kaggle competitions
- Apply to your work
- Keep learning advanced topics (Bayesian stats, ML, etc.)
Tip: Python is powerful but Excel is ubiquitous - know both well!