Machine Learning Basics
| Term | Definition | Example | |------|------------|---------| | Machine Learning | Algorithms that learn patterns from data without explicit rules | Email spam filter learns from examples vs hardcoded "if contains 'free' → spam" | | Training | Process of teaching model using historical data | Feed 10,000 labeled emails to learn spam patterns | | Model | Mathematical representation of learned patterns | Equation, decision tree, neural network | | Feature | Input variable used for prediction | Email: sender, subject length, # of links, word frequency | | Label/Target | Output you want to predict | Spam/Not Spam, House Price, Customer Churn | | Prediction | Model's output for new data | "This email is 92% likely spam" | | Overfitting | Model memorizes training data, fails on new data | Like memorizing answers vs understanding concepts | | Underfitting | Model too simple to capture patterns | Using straight line for curved relationship | | Generalization | Model performs well on unseen data | True measure of model quality |
Types of Machine Learning
| Type | Description | Common Algorithms | Use Cases | |------|-------------|-------------------|-----------| | Supervised Learning | Learn from labeled data (input + correct answer) | Linear/Logistic Regression, Random Forest, XGBoost | Predict sales, classify emails, forecast demand | | Unsupervised Learning | Find patterns in unlabeled data | K-Means, PCA, DBSCAN | Customer segmentation, anomaly detection | | Regression | Predict continuous number (supervised) | Linear Regression, XGBoost | House prices, sales revenue, temperature | | Classification | Predict category (supervised) | Logistic Regression, Random Forest, Neural Networks | Spam detection, loan approval, image recognition | | Clustering | Group similar items (unsupervised) | K-Means, Hierarchical, DBSCAN | Customer segments, product categorization | | Dimensionality Reduction | Reduce features while keeping info | PCA, t-SNE, UMAP | Visualize high-dim data, compress features |
Common Algorithms
| Algorithm | Type | Best For | Pros | Cons | |-----------|------|----------|------|------| | Linear Regression | Regression | Simple relationships | Fast, interpretable | Only linear patterns | | Logistic Regression | Classification | Binary outcomes | Probabilistic output, fast | Linear decision boundary | | Decision Tree | Both | Explainable rules | Easy to visualize | Overfits easily | | Random Forest | Both | Tabular data | Accurate, robust | Black box, slower | | XGBoost | Both | Competitions, high accuracy | State-of-art for tables | Needs tuning | | K-Means | Clustering | Customer segmentation | Fast, simple | Must choose K | | Neural Network | Both | Images, text, complex | Can learn anything | Needs big data, slow | | KNN | Both | Small datasets | No training, simple | Slow predictions |
⚠️ CheckpointQuiz error: Missing or invalid options array
Model Evaluation Metrics
Regression Metrics:
| Metric | Formula | Interpretation | |--------|---------|----------------| | MAE | Mean Absolute Error | Average error magnitude (same units as target) | | RMSE | Root Mean Squared Error | Penalizes large errors more than MAE | | R² | 0 to 1 | % of variance explained (higher = better) | | MAPE | Mean Absolute % Error | Error as percentage of actual value |
Classification Metrics:
| Metric | When to Use | |--------|-------------| | Accuracy | Balanced classes (50-50 split) | | Precision | Minimize false positives (spam filter: don't block real emails) | | Recall | Minimize false negatives (fraud: catch all fraudulent transactions) | | F1-Score | Balance precision and recall | | ROC-AUC | Overall performance across thresholds |
Confusion Matrix:
- True Positive (TP): Correctly predicted positive
- False Positive (FP): Incorrectly predicted positive (Type I error)
- True Negative (TN): Correctly predicted negative
- False Negative (FN): Incorrectly predicted negative (Type II error)
Key Concepts
| Term | Definition | Example | |------|------------|---------| | Train-Test Split | 80% train, 20% test (evaluate on unseen data) | Prevents overfitting evaluation | | Cross-Validation | K-fold: split into K parts, train K times | More robust than single train-test | | Feature Engineering | Create new features from existing data | From date: day_of_week, month, is_weekend | | One-Hot Encoding | Convert categories to binary columns | City: Mumbai=[1,0,0], Delhi=[0,1,0] | | Normalization | Scale features to 0-1 range | Makes algorithms converge faster | | Hyperparameter | Settings you choose (not learned) | Learning rate, tree depth, K in KNN | | Ensemble | Combine multiple models | Random Forest = ensemble of trees | | Bias-Variance | Tradeoff between simplicity and complexity | Low bias + low variance = ideal | | Regularization | Penalty for model complexity | Prevents overfitting | | Learning Curve | Plot train/test accuracy vs data size | Diagnose overfitting/underfitting |
⚠️ FinalQuiz error: Missing or invalid questions array
⚠️ SummarySection error: Missing or invalid items array
Received: {"hasItems":false,"isArray":false}