#1 Data Analytics Program in India
₹2,499₹1,499Enroll Now
Module 11
13 min read

Python Statistical Analysis

Statistical analysis with Python libraries

What You'll Learn

  • Essential Python libraries for statistics
  • NumPy and Pandas basics
  • Statistical analysis with SciPy
  • Visualization with Matplotlib and Seaborn
  • Complete analysis workflows

Essential Libraries

Install:

pip install numpy pandas scipy matplotlib seaborn statsmodels scikit-learn

Core libraries:

NumPy: Numerical computing, arrays

Pandas: Data manipulation, DataFrames

SciPy: Statistical functions

Statsmodels: Statistical models, tests

Matplotlib/Seaborn: Visualization

Scikit-learn: Machine learning (includes some stats)

NumPy Basics

Import:

import numpy as np

Create arrays:

data = np.array([1, 2, 3, 4, 5])
data_2d = np.array([[1, 2], [3, 4], [5, 6]])

Basic statistics:

np.mean(data)      # Mean
np.median(data)    # Median
np.std(data)       # Standard deviation (population)
np.std(data, ddof=1)  # Sample std dev
np.var(data)       # Variance
np.min(data)       # Minimum
np.max(data)       # Maximum
np.percentile(data, 75)  # 75th percentile

Useful functions:

np.sum(data)       # Sum
np.cumsum(data)    # Cumulative sum
np.diff(data)      # Differences
np.corrcoef(x, y)  # Correlation coefficient

Pandas Basics

Import:

import pandas as pd

Create DataFrame:

df = pd.DataFrame({
    'sales': [100, 110, 105, 115, 120],
    'advertising': [10, 12, 11, 13, 14],
    'region': ['North', 'South', 'North', 'West', 'South']
})

Read data:

df = pd.read_csv('data.csv')
df = pd.read_excel('data.xlsx')

Basic info:

df.head()          # First 5 rows
df.info()          # Data types, null counts
df.describe()      # Summary statistics
df.shape           # (rows, columns)
df.columns         # Column names

Select data:

df['sales']        # One column (Series)
df[['sales', 'advertising']]  # Multiple columns
df.iloc[0]         # First row by position
df.loc[0]          # First row by label
df[df['sales'] > 100]  # Filter rows

Descriptive Statistics with Pandas

Summary statistics:

df.describe()      # All numeric columns

# Individual metrics
df['sales'].mean()
df['sales'].median()
df['sales'].std()
df['sales'].var()
df['sales'].min()
df['sales'].max()
df['sales'].quantile(0.25)  # Q1
df['sales'].quantile(0.75)  # Q3

Group statistics:

df.groupby('region')['sales'].mean()
df.groupby('region').agg({
    'sales': ['mean', 'std', 'count']
})

Correlation matrix:

df.corr()          # All numeric columns
df[['sales', 'advertising']].corr()

Statistical Tests with SciPy

Import:

from scipy import stats

t-Tests:

# One-sample t-test
stats.ttest_1samp(data, popmean=100)

# Two-sample t-test (independent)
stats.ttest_ind(group1, group2)

# Paired t-test
stats.ttest_rel(before, after)

# Returns: (t-statistic, p-value)

Example:

group_a = [23, 25, 27, 29, 31]
group_b = [20, 22, 24, 26, 28]

t_stat, p_value = stats.ttest_ind(group_a, group_b)
print(f"t-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Significant difference!")
else:
    print("No significant difference")

Other tests:

# Chi-square test
stats.chi2_contingency(contingency_table)

# Pearson correlation
stats.pearsonr(x, y)

# Spearman correlation
stats.spearmanr(x, y)

# Normality test
stats.shapiro(data)

# ANOVA
stats.f_oneway(group1, group2, group3)

Linear Regression

Simple linear regression:

from scipy.stats import linregress

x = [1, 2, 3, 4, 5]
y = [2, 4, 5, 4, 5]

slope, intercept, r_value, p_value, std_err = linregress(x, y)

print(f"Slope: {slope:.4f}")
print(f"Intercept: {intercept:.4f}")
print(f"R-squared: {r_value**2:.4f}")
print(f"p-value: {p_value:.4f}")

# Make predictions
y_pred = slope * np.array(x) + intercept

Multiple regression with statsmodels:

import statsmodels.api as sm

# Prepare data
X = df[['advertising', 'price']]
y = df['sales']

# Add constant (intercept)
X = sm.add_constant(X)

# Fit model
model = sm.OLS(y, X).fit()

# Summary
print(model.summary())

# Predictions
predictions = model.predict(X)

# Coefficients
print(model.params)

# R-squared
print(model.rsquared)

Regression with scikit-learn:

from sklearn.linear_model import LinearRegression

X = df[['advertising', 'price']]
y = df['sales']

model = LinearRegression()
model.fit(X, y)

print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")
print(f"R-squared: {model.score(X, y)}")

# Predict
predictions = model.predict(X)

Visualization with Matplotlib

Import:

import matplotlib.pyplot as plt

Basic plots:

# Line plot
plt.plot(x, y)
plt.xlabel('X label')
plt.ylabel('Y label')
plt.title('Title')
plt.show()

# Scatter plot
plt.scatter(x, y)
plt.show()

# Histogram
plt.hist(data, bins=10)
plt.show()

# Box plot
plt.boxplot(data)
plt.show()

# Bar chart
plt.bar(categories, values)
plt.show()

Multiple subplots:

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

axes[0, 0].plot(x, y)
axes[0, 0].set_title('Line Plot')

axes[0, 1].scatter(x, y)
axes[0, 1].set_title('Scatter Plot')

axes[1, 0].hist(data, bins=10)
axes[1, 0].set_title('Histogram')

axes[1, 1].boxplot(data)
axes[1, 1].set_title('Box Plot')

plt.tight_layout()
plt.show()

Visualization with Seaborn

Import:

import seaborn as sns
sns.set_theme()  # Better default styling

Distribution plots:

# Histogram with KDE
sns.histplot(data, kde=True)

# Box plot
sns.boxplot(data=df, x='region', y='sales')

# Violin plot
sns.violinplot(data=df, x='region', y='sales')

Relationship plots:

# Scatter with regression line
sns.regplot(x='advertising', y='sales', data=df)

# Scatter plot matrix
sns.pairplot(df)

# Correlation heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

Categorical plots:

# Bar plot with error bars
sns.barplot(data=df, x='region', y='sales')

# Count plot
sns.countplot(data=df, x='region')

Time Series Analysis

Date handling:

df['date'] = pd.to_datetime(df['date'])
df.set_index('date', inplace=True)

Resampling:

# Monthly average
df_monthly = df.resample('M').mean()

# Quarterly sum
df_quarterly = df.resample('Q').sum()

Moving average:

df['MA_3'] = df['sales'].rolling(window=3).mean()

Exponential smoothing:

df['EXP'] = df['sales'].ewm(alpha=0.3, adjust=False).mean()

Decomposition:

from statsmodels.tsa.seasonal import seasonal_decompose

decomposition = seasonal_decompose(df['sales'], model='additive', period=12)

fig, axes = plt.subplots(4, 1, figsize=(10, 8))
decomposition.observed.plot(ax=axes[0])
decomposition.trend.plot(ax=axes[1])
decomposition.seasonal.plot(ax=axes[2])
decomposition.resid.plot(ax=axes[3])
plt.show()

Complete Analysis Workflow

Sales analysis example:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm

# 1. Load data
df = pd.read_csv('sales_data.csv')

# 2. Initial exploration
print(df.head())
print(df.info())
print(df.describe())

# 3. Data cleaning
df.dropna(inplace=True)  # Remove missing
df = df[df['sales'] > 0]  # Remove invalid

# 4. Descriptive statistics
print("\nSales Statistics:")
print(f"Mean: {df['sales'].mean():.2f}")
print(f"Median: {df['sales'].median():.2f}")
print(f"Std Dev: {df['sales'].std():.2f}")

# 5. Group analysis
regional_stats = df.groupby('region')['sales'].agg(['mean', 'std', 'count'])
print("\nRegional Statistics:")
print(regional_stats)

# 6. Visualizations
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Distribution
axes[0, 0].hist(df['sales'], bins=20, edgecolor='black')
axes[0, 0].set_title('Sales Distribution')
axes[0, 0].set_xlabel('Sales')

# Box plot by region
df.boxplot(column='sales', by='region', ax=axes[0, 1])
axes[0, 1].set_title('Sales by Region')

# Scatter: Sales vs Advertising
axes[1, 0].scatter(df['advertising'], df['sales'])
axes[1, 0].set_xlabel('Advertising')
axes[1, 0].set_ylabel('Sales')
axes[1, 0].set_title('Sales vs Advertising')

# Correlation heatmap
corr_data = df[['sales', 'advertising', 'price']].corr()
sns.heatmap(corr_data, annot=True, cmap='coolwarm', ax=axes[1, 1])

plt.tight_layout()
plt.show()

# 7. Statistical tests
# Compare regions
north = df[df['region'] == 'North']['sales']
south = df[df['region'] == 'South']['sales']
t_stat, p_value = stats.ttest_ind(north, south)
print(f"\nt-test North vs South: p-value = {p_value:.4f}")

# 8. Regression analysis
X = df[['advertising', 'price']]
y = df['sales']
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
print("\nRegression Results:")
print(model.summary())

# 9. Predictions
new_data = pd.DataFrame({
    'const': [1],
    'advertising': [15],
    'price': [25]
})
prediction = model.predict(new_data)
print(f"\nPredicted sales: {prediction.values[0]:.2f}")

A/B Test Analysis

Complete A/B test:

import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt

# Data
df = pd.read_csv('ab_test.csv')

# Separate groups
control = df[df['group'] == 'A']['conversion']
treatment = df[df['group'] == 'B']['conversion']

# Summary statistics
print(f"Control: n={len(control)}, mean={control.mean():.4f}")
print(f"Treatment: n={len(treatment)}, mean={treatment.mean():.4f}")
print(f"Lift: {(treatment.mean() - control.mean()) / control.mean() * 100:.2f}%")

# Statistical test
t_stat, p_value = stats.ttest_ind(control, treatment)
print(f"\nt-statistic: {t_stat:.4f}")
print(f"p-value: {p_value:.4f}")

if p_value < 0.05:
    print("Result: SIGNIFICANT")
else:
    print("Result: NOT SIGNIFICANT")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Distributions
axes[0].hist(control, alpha=0.5, label='Control', bins=20)
axes[0].hist(treatment, alpha=0.5, label='Treatment', bins=20)
axes[0].legend()
axes[0].set_xlabel('Conversion Rate')
axes[0].set_title('Distribution')

# Box plot
axes[1].boxplot([control, treatment], labels=['Control', 'Treatment'])
axes[1].set_ylabel('Conversion Rate')
axes[1].set_title('Comparison')

plt.tight_layout()
plt.show()

Tips and Best Practices

1. Jupyter notebooks: Best for exploratory analysis

pip install jupyter
jupyter notebook

2. Virtual environments:

python -m venv myenv
source myenv/bin/activate  # Mac/Linux
myenv\Scripts\activate     # Windows

3. Save/load models:

import pickle
pickle.dump(model, open('model.pkl', 'wb'))
model = pickle.load(open('model.pkl', 'rb'))

4. Suppress warnings:

import warnings
warnings.filterwarnings('ignore')

5. Set random seed:

np.random.seed(42)  # Reproducibility

Common Mistakes

1. Not checking data types:

df.dtypes  # Always check!

2. Forgetting to handle missing values:

df.isnull().sum()  # Count nulls
df.dropna()  # Remove
df.fillna(0)  # Fill with 0

3. Using wrong ddof: NumPy default: ddof=0 (population) Pandas default: ddof=1 (sample)

4. Not resetting index after filtering:

df.reset_index(drop=True, inplace=True)

5. Mixing libraries unnecessarily: Pick one approach and stick with it

Resources

Documentation:

  • NumPy: numpy.org
  • Pandas: pandas.pydata.org
  • SciPy: scipy.org
  • Matplotlib: matplotlib.org
  • Seaborn: seaborn.pydata.org

Learning:

  • Python for Data Analysis (book by Wes McKinney)
  • Kaggle Learn courses
  • Real Python tutorials

Practice Exercise

Dataset: Student scores Create DataFrame with: student_id, hours_studied, previous_score, final_score

Tasks:

  1. Load and explore data
  2. Calculate descriptive statistics
  3. Check correlation between variables
  4. Build regression model: final_score ~ hours_studied + previous_score
  5. Visualize relationships
  6. Test if hours_studied significantly affects scores

Solution template provided in course materials

Congratulations!

You've completed the Statistics course!

You now know: ✓ Descriptive statistics ✓ Probability and distributions ✓ Hypothesis testing ✓ Regression analysis ✓ A/B testing ✓ Time series forecasting ✓ Excel and Python implementation

Next steps:

  • Practice on real datasets
  • Kaggle competitions
  • Apply to your work
  • Keep learning advanced topics (Bayesian stats, ML, etc.)

Tip: Python is powerful but Excel is ubiquitous - know both well!

SkillsetMaster - AI, Web Development & Data Analytics Courses