Topic 84 of

Machine Learning Glossary: ML Terms for Data Analysts

Machine Learning sounds intimidating, but analysts use ML daily — forecasting demand, recommending products, detecting fraud. This glossary demystifies the jargon.

📚Beginner
⏱️7 min
5 quizzes
🤖

Machine Learning Basics

| Term | Definition | Example | |------|------------|---------| | Machine Learning | Algorithms that learn patterns from data without explicit rules | Email spam filter learns from examples vs hardcoded "if contains 'free' → spam" | | Training | Process of teaching model using historical data | Feed 10,000 labeled emails to learn spam patterns | | Model | Mathematical representation of learned patterns | Equation, decision tree, neural network | | Feature | Input variable used for prediction | Email: sender, subject length, # of links, word frequency | | Label/Target | Output you want to predict | Spam/Not Spam, House Price, Customer Churn | | Prediction | Model's output for new data | "This email is 92% likely spam" | | Overfitting | Model memorizes training data, fails on new data | Like memorizing answers vs understanding concepts | | Underfitting | Model too simple to capture patterns | Using straight line for curved relationship | | Generalization | Model performs well on unseen data | True measure of model quality |

🎯

Types of Machine Learning

| Type | Description | Common Algorithms | Use Cases | |------|-------------|-------------------|-----------| | Supervised Learning | Learn from labeled data (input + correct answer) | Linear/Logistic Regression, Random Forest, XGBoost | Predict sales, classify emails, forecast demand | | Unsupervised Learning | Find patterns in unlabeled data | K-Means, PCA, DBSCAN | Customer segmentation, anomaly detection | | Regression | Predict continuous number (supervised) | Linear Regression, XGBoost | House prices, sales revenue, temperature | | Classification | Predict category (supervised) | Logistic Regression, Random Forest, Neural Networks | Spam detection, loan approval, image recognition | | Clustering | Group similar items (unsupervised) | K-Means, Hierarchical, DBSCAN | Customer segments, product categorization | | Dimensionality Reduction | Reduce features while keeping info | PCA, t-SNE, UMAP | Visualize high-dim data, compress features |

⚙️

Common Algorithms

| Algorithm | Type | Best For | Pros | Cons | |-----------|------|----------|------|------| | Linear Regression | Regression | Simple relationships | Fast, interpretable | Only linear patterns | | Logistic Regression | Classification | Binary outcomes | Probabilistic output, fast | Linear decision boundary | | Decision Tree | Both | Explainable rules | Easy to visualize | Overfits easily | | Random Forest | Both | Tabular data | Accurate, robust | Black box, slower | | XGBoost | Both | Competitions, high accuracy | State-of-art for tables | Needs tuning | | K-Means | Clustering | Customer segmentation | Fast, simple | Must choose K | | Neural Network | Both | Images, text, complex | Can learn anything | Needs big data, slow | | KNN | Both | Small datasets | No training, simple | Slow predictions |

⚠️ CheckpointQuiz error: Missing or invalid options array

📊

Model Evaluation Metrics

Regression Metrics:

| Metric | Formula | Interpretation | |--------|---------|----------------| | MAE | Mean Absolute Error | Average error magnitude (same units as target) | | RMSE | Root Mean Squared Error | Penalizes large errors more than MAE | | | 0 to 1 | % of variance explained (higher = better) | | MAPE | Mean Absolute % Error | Error as percentage of actual value |

Classification Metrics:

| Metric | When to Use | |--------|-------------| | Accuracy | Balanced classes (50-50 split) | | Precision | Minimize false positives (spam filter: don't block real emails) | | Recall | Minimize false negatives (fraud: catch all fraudulent transactions) | | F1-Score | Balance precision and recall | | ROC-AUC | Overall performance across thresholds |

Confusion Matrix:

  • True Positive (TP): Correctly predicted positive
  • False Positive (FP): Incorrectly predicted positive (Type I error)
  • True Negative (TN): Correctly predicted negative
  • False Negative (FN): Incorrectly predicted negative (Type II error)
🔑

Key Concepts

| Term | Definition | Example | |------|------------|---------| | Train-Test Split | 80% train, 20% test (evaluate on unseen data) | Prevents overfitting evaluation | | Cross-Validation | K-fold: split into K parts, train K times | More robust than single train-test | | Feature Engineering | Create new features from existing data | From date: day_of_week, month, is_weekend | | One-Hot Encoding | Convert categories to binary columns | City: Mumbai=[1,0,0], Delhi=[0,1,0] | | Normalization | Scale features to 0-1 range | Makes algorithms converge faster | | Hyperparameter | Settings you choose (not learned) | Learning rate, tree depth, K in KNN | | Ensemble | Combine multiple models | Random Forest = ensemble of trees | | Bias-Variance | Tradeoff between simplicity and complexity | Low bias + low variance = ideal | | Regularization | Penalty for model complexity | Prevents overfitting | | Learning Curve | Plot train/test accuracy vs data size | Diagnose overfitting/underfitting |

⚠️ FinalQuiz error: Missing or invalid questions array

⚠️ SummarySection error: Missing or invalid items array

Received: {"hasItems":false,"isArray":false}