2.3 Classification Models
Key Takeaways
- Classification models predict discrete categories (spam/not spam, disease/no disease) — the output is a class label, not a number.
- Binary classification has two possible outcomes (yes/no); multi-class classification has three or more possible outcomes.
- Key evaluation metrics: accuracy (overall correctness), precision (quality of positive predictions), recall (ability to find all positives), F1-score (balance of precision and recall).
- A confusion matrix shows true positives, true negatives, false positives, and false negatives — it is the foundation for calculating all classification metrics.
- The AI-900 tests conceptual understanding of classification — know what the metrics mean, not how to calculate them.
Classification Models
Quick Answer: Classification models predict discrete categories like spam/not spam or disease/no disease. Binary classification has two outcomes; multi-class has three or more. Evaluation metrics include accuracy, precision, recall, and F1-score. The confusion matrix shows true/false positives and negatives.
What Is Classification?
Classification is a supervised machine learning technique that predicts a discrete category (class) for each input. Unlike regression which outputs a number, classification outputs a label.
How to Identify a Classification Problem
Ask yourself: Is the predicted output a category or label?
- Predicting if an email is spam or not → Classification (two categories)
- Predicting a house price → Regression (continuous number)
- Predicting the species of a flower (setosa, versicolor, virginica) → Classification (three categories)
- Grouping customers by behavior → Clustering (unsupervised)
Binary vs. Multi-Class Classification
Binary Classification
Predicts one of two possible outcomes:
| Use Case | Class 0 (Negative) | Class 1 (Positive) |
|---|---|---|
| Spam detection | Not Spam | Spam |
| Medical diagnosis | No Disease | Disease |
| Fraud detection | Legitimate | Fraudulent |
| Loan approval | Deny | Approve |
| Defect detection | Good | Defective |
Multi-Class Classification
Predicts one of three or more possible outcomes:
| Use Case | Possible Classes |
|---|---|
| Image recognition | Cat, Dog, Bird, Fish |
| Sentiment analysis | Positive, Negative, Neutral |
| Customer segmentation | Premium, Standard, Basic |
| Document classification | Invoice, Receipt, Contract, Letter |
| Support ticket routing | Billing, Technical, Account, General |
The Confusion Matrix
The confusion matrix is a table that shows how well a classification model performs by comparing predictions to actual values. For binary classification:
| Predicted: Positive | Predicted: Negative | |
|---|---|---|
| Actual: Positive | True Positive (TP) | False Negative (FN) |
| Actual: Negative | False Positive (FP) | True Negative (TN) |
- True Positive (TP): Model correctly predicts positive (detected spam that was actually spam)
- True Negative (TN): Model correctly predicts negative (correctly identified a legitimate email)
- False Positive (FP): Model incorrectly predicts positive (flagged a legitimate email as spam) — also called a "Type I error"
- False Negative (FN): Model incorrectly predicts negative (missed actual spam) — also called a "Type II error"
On the Exam: Questions may present a scenario and ask which type of error is more concerning. For medical screening, false negatives (missing a disease) are typically worse than false positives (incorrectly flagging someone for further testing).
Classification Evaluation Metrics
Accuracy
The proportion of all predictions that are correct.
Formula: (TP + TN) / (TP + TN + FP + FN)
Interpretation: "What percentage of all predictions were correct?"
Limitation: Accuracy can be misleading with imbalanced datasets. If 95% of emails are not spam, a model that always predicts "not spam" has 95% accuracy but catches zero spam.
Precision
The proportion of positive predictions that are actually correct.
Formula: TP / (TP + FP)
Interpretation: "Of all the items I flagged as positive, how many actually were positive?"
When precision matters: When false positives are costly. Example: A spam filter with high precision rarely flags legitimate emails as spam.
Recall (Sensitivity)
The proportion of actual positives that the model correctly identifies.
Formula: TP / (TP + FN)
Interpretation: "Of all the actual positive items, how many did I find?"
When recall matters: When false negatives are costly. Example: A medical screening with high recall catches most actual disease cases.
F1-Score
The harmonic mean of precision and recall — a balanced measure.
Interpretation: A single metric that balances precision and recall. Useful when you need both to be high.
| Metric | Question It Answers | When It Matters |
|---|---|---|
| Accuracy | "How often is the model correct overall?" | Balanced datasets |
| Precision | "When the model says positive, is it right?" | False positives are costly |
| Recall | "Does the model find all positives?" | False negatives are costly |
| F1-Score | "Is there a good balance of precision and recall?" | Both errors are costly |
Precision vs. Recall Trade-off
There is often a trade-off between precision and recall:
- Increasing precision (fewer false positives) often decreases recall (more false negatives)
- Increasing recall (fewer false negatives) often decreases precision (more false positives)
Example: Medical screening
- High recall priority: You want to catch EVERY case of a disease, even if some healthy people are flagged for follow-up tests (accept more false positives to minimize false negatives)
- High precision priority: You only want to diagnose someone as sick when you are very confident (accept missing some cases to minimize false alarms)
On the Exam: Questions may describe a scenario and ask which metric is most important. Medical screening = recall (catch every case). Spam filter = precision (don't block legitimate emails). When both matter equally = F1-score.
A hospital is building an AI model to screen patients for a rare disease. Missing a positive case could be life-threatening. Which evaluation metric should be prioritized?
In a confusion matrix, what is a "false positive"?
A dataset has 950 legitimate transactions and 50 fraudulent ones. A model that always predicts "legitimate" has what accuracy?
Which classification type would you use to categorize customer support tickets into "Billing", "Technical", "Account", and "General"?
Match each classification metric to what it measures:
Match each item on the left with the correct item on the right