A hospital is building an AI model to screen patients for a rare disease. Missing a positive case could be life-threatening. Which evaluation metric should be prioritized?

Recall. Recall should be prioritized because missing a positive case (false negative) could be life-threatening. High recall ensures the model catches as many actual positive cases as possible, even if some healthy patients are flagged for additional testing (false positives).

In a confusion matrix, what is a "false positive"?

The model incorrectly predicts a positive outcome when the actual value is negative. A false positive occurs when the model predicts positive but the actual value is negative. For example, a spam filter incorrectly flagging a legitimate email as spam. This is also called a Type I error.

A dataset has 950 legitimate transactions and 50 fraudulent ones. A model that always predicts "legitimate" has what accuracy?

95%. The model correctly identifies 950 out of 1,000 total transactions (all legitimate ones), giving 95% accuracy. However, this model catches zero fraud — its recall for fraud is 0%. This demonstrates why accuracy alone can be misleading with imbalanced datasets.

Which classification type would you use to categorize customer support tickets into "Billing", "Technical", "Account", and "General"?

Multi-class classification. This is multi-class classification because there are four predefined categories. Binary classification only has two outcomes (yes/no). Regression predicts continuous numbers. Clustering finds groups without predefined labels.

Classification Models

Quick Answer: Classification models predict discrete categories like spam/not spam or disease/no disease. Binary classification has two outcomes; multi-class has three or more. Evaluation metrics include accuracy, precision, recall, and F1-score. The confusion matrix shows true/false positives and negatives.

What Is Classification?

Classification is a supervised machine learning technique that predicts a discrete category (class) for each input. Unlike regression which outputs a number, classification outputs a label.

How to Identify a Classification Problem

Ask yourself: Is the predicted output a category or label?

Predicting if an email is spam or not → Classification (two categories)
Predicting a house price → Regression (continuous number)
Predicting the species of a flower (setosa, versicolor, virginica) → Classification (three categories)
Grouping customers by behavior → Clustering (unsupervised)

Binary vs. Multi-Class Classification

Binary Classification

Predicts one of two possible outcomes:

Use Case	Class 0 (Negative)	Class 1 (Positive)
Spam detection	Not Spam	Spam
Medical diagnosis	No Disease	Disease
Fraud detection	Legitimate	Fraudulent
Loan approval	Deny	Approve
Defect detection	Good	Defective

Multi-Class Classification

Predicts one of three or more possible outcomes:

Use Case	Possible Classes
Image recognition	Cat, Dog, Bird, Fish
Sentiment analysis	Positive, Negative, Neutral
Customer segmentation	Premium, Standard, Basic
Document classification	Invoice, Receipt, Contract, Letter
Support ticket routing	Billing, Technical, Account, General

The Confusion Matrix

The confusion matrix is a table that shows how well a classification model performs by comparing predictions to actual values. For binary classification:

	Predicted: Positive	Predicted: Negative
Actual: Positive	True Positive (TP)	False Negative (FN)
Actual: Negative	False Positive (FP)	True Negative (TN)

True Positive (TP): Model correctly predicts positive (detected spam that was actually spam)
True Negative (TN): Model correctly predicts negative (correctly identified a legitimate email)
False Positive (FP): Model incorrectly predicts positive (flagged a legitimate email as spam) — also called a "Type I error"
False Negative (FN): Model incorrectly predicts negative (missed actual spam) — also called a "Type II error"

On the Exam: Questions may present a scenario and ask which type of error is more concerning. For medical screening, false negatives (missing a disease) are typically worse than false positives (incorrectly flagging someone for further testing).

Classification Evaluation Metrics

Accuracy

The proportion of all predictions that are correct.

Formula: (TP + TN) / (TP + TN + FP + FN)

Interpretation: "What percentage of all predictions were correct?"

Limitation: Accuracy can be misleading with imbalanced datasets. If 95% of emails are not spam, a model that always predicts "not spam" has 95% accuracy but catches zero spam.

Precision

The proportion of positive predictions that are actually correct.

Formula: TP / (TP + FP)

Interpretation: "Of all the items I flagged as positive, how many actually were positive?"

When precision matters: When false positives are costly. Example: A spam filter with high precision rarely flags legitimate emails as spam.

Recall (Sensitivity)

The proportion of actual positives that the model correctly identifies.

Formula: TP / (TP + FN)

Interpretation: "Of all the actual positive items, how many did I find?"

When recall matters: When false negatives are costly. Example: A medical screening with high recall catches most actual disease cases.

F1-Score

The harmonic mean of precision and recall — a balanced measure.

Interpretation: A single metric that balances precision and recall. Useful when you need both to be high.

Metric	Question It Answers	When It Matters
Accuracy	"How often is the model correct overall?"	Balanced datasets
Precision	"When the model says positive, is it right?"	False positives are costly
Recall	"Does the model find all positives?"	False negatives are costly
F1-Score	"Is there a good balance of precision and recall?"	Both errors are costly

Precision vs. Recall Trade-off

There is often a trade-off between precision and recall:

Increasing precision (fewer false positives) often decreases recall (more false negatives)
Increasing recall (fewer false negatives) often decreases precision (more false positives)

Example: Medical screening

High recall priority: You want to catch EVERY case of a disease, even if some healthy people are flagged for follow-up tests (accept more false positives to minimize false negatives)
High precision priority: You only want to diagnose someone as sick when you are very confident (accept missing some cases to minimize false alarms)

On the Exam: Questions may describe a scenario and ask which metric is most important. Medical screening = recall (catch every case). Spam filter = precision (don't block legitimate emails). When both matter equally = F1-score.

Microsoft Azure AI Fundamentals

2.3 Classification Models

Key Takeaways

Classification Models

What Is Classification?

How to Identify a Classification Problem

Binary vs. Multi-Class Classification

Binary Classification

Multi-Class Classification

The Confusion Matrix

Classification Evaluation Metrics

Accuracy

Precision

Recall (Sensitivity)

F1-Score

Precision vs. Recall Trade-off

Microsoft Azure AI Fundamentals

1Introduction

2Domain 1: Describe AI Workloads and Considerations (15-20%)

3Domain 2: Fundamental Principles of Machine Learning on Azure (20-25%)

4Domain 3: Computer Vision Workloads on Azure (15-20%)

5Domain 4: Natural Language Processing Workloads on Azure (15-20%)

6Domain 5: Generative AI Workloads on Azure (15-20%)

7Exam Review and Full-Length Practice Questions

2.3 Classification Models

Key Takeaways

Classification Models

What Is Classification?

How to Identify a Classification Problem

Binary vs. Multi-Class Classification

Binary Classification

Multi-Class Classification

The Confusion Matrix

Classification Evaluation Metrics

Accuracy

Precision

Recall (Sensitivity)

F1-Score

Precision vs. Recall Trade-off