Introduction to Machine Learning

What is Machine Learning?

Machine Learning (ML) is teaching computers to learn from data.

Instead of programming explicit rules, we show examples and the computer figures out patterns.

Traditional Programming vs ML

Traditional:

Rules + Data → Answer

Machine Learning:

Data + Answers → Rules (Model)

Types of Machine Learning

1. Supervised Learning

Learn from labeled data (we know the answers):

Classification: Predict categories
- Is this email spam or not?
- Is this tumor benign or malignant?
Regression: Predict numbers
- What will the house price be?
- How many sales next month?

2. Unsupervised Learning

Find patterns in unlabeled data:

Clustering: Group similar items
- Customer segments
- Similar documents

3. Reinforcement Learning

Learn by trial and error:

Game playing AI
Self-driving cars

The ML Workflow

1. Collect Data
2. Prepare Data (clean, transform)
3. Split Data (train/test)
4. Choose Model
5. Train Model
6. Evaluate Model
7. Improve & Repeat

Scikit-Learn Basics

The most popular ML library in Python:

code.pyPython

# Install: pip install scikit-learn

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Sample data
X = np.array([[1], [2], [3], [4], [5]])  # Features
y = np.array([2, 4, 6, 8, 10])           # Target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Create and train model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print(predictions)

Key ML Concepts

Features (X)

The input data used to make predictions:

Age, income, location (for loan approval)
Pixels (for image classification)
Words (for text classification)

Target (y)

What we want to predict:

Loan approved/rejected
Cat/dog
Spam/not spam

Training

Showing the model examples so it learns patterns:

code.pyPython

model.fit(X_train, y_train)

Prediction

Using the trained model on new data:

code.pyPython

predictions = model.predict(X_new)

Simple Classification Example

code.pyPython

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import numpy as np

# Simple dataset: predict if someone buys product
# Features: age, income (in thousands)
X = np.array([
    [25, 40], [30, 50], [35, 60], [40, 70],
    [45, 80], [50, 90], [22, 30], [28, 35],
    [55, 95], [60, 100]
])

# Target: 1 = bought, 0 = didn't buy
y = np.array([0, 0, 1, 1, 1, 1, 0, 0, 1, 1])

# Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Train
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train, y_train)

# Predict
predictions = model.predict(X_test)
print(f"Predictions: {predictions}")
print(f"Actual: {y_test}")

# Accuracy
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.0%}")

Common ML Algorithms

Algorithm	Type	Use Case
Linear Regression	Regression	Price prediction
Logistic Regression	Classification	Yes/No decisions
Decision Tree	Both	Easy to interpret
Random Forest	Both	High accuracy
KNN	Both	Simple, no training
SVM	Both	Complex boundaries

Overfitting vs Underfitting

Overfitting

Model learns training data too well
Memorizes instead of generalizing
Poor on new data

Underfitting

Model is too simple
Doesn't capture patterns
Poor on all data

Goal: Find the right balance!

Complete Example

code.pyPython

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

# Load famous iris dataset
iris = load_iris()
X = iris.data
y = iris.target

print(f"Features: {iris.feature_names}")
print(f"Classes: {iris.target_names}")
print(f"Data shape: {X.shape}")

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = DecisionTreeClassifier(max_depth=3)
model.fit(X_train, y_train)

# Evaluate
train_acc = model.score(X_train, y_train)
test_acc = model.score(X_test, y_test)

print(f"\nTraining accuracy: {train_acc:.0%}")
print(f"Test accuracy: {test_acc:.0%}")

Key Points

ML learns patterns from data
Supervised: Has labels (classification, regression)
Unsupervised: No labels (clustering)
Split data into train and test sets
Use scikit-learn for ML in Python
Watch out for overfitting

What's Next?

Learn how to properly split data for training and testing.