2.2 Regression Models
Key Takeaways
- Regression models predict continuous numerical values — house prices, temperatures, sales volumes, stock prices.
- Linear regression finds the straight-line relationship between features and the label; it is the simplest regression algorithm.
- Key evaluation metrics for regression: MAE (average error magnitude), RMSE (penalizes large errors), R-squared (proportion of variance explained, where 1.0 = perfect).
- R-squared (coefficient of determination) ranges from 0 to 1 — higher values indicate better model fit. An R-squared of 0.85 means the model explains 85% of variance in the data.
- The AI-900 tests conceptual understanding of regression — you will not be asked to calculate metrics or write regression code.
Regression Models
Quick Answer: Regression models predict continuous numerical values like prices, temperatures, and quantities. Linear regression finds the straight-line relationship between features and the label. Key evaluation metrics are MAE (Mean Absolute Error), RMSE (Root Mean Squared Error), and R-squared (proportion of variance explained). Higher R-squared values indicate better model performance.
What Is Regression?
Regression is a supervised machine learning technique that predicts a continuous numerical value based on input features. The output is a number on a continuous scale, not a category.
How to Identify a Regression Problem
Ask yourself: Is the predicted output a number on a continuous scale?
- Predicting a house price ($450,000) → Regression (continuous number)
- Predicting if an email is spam → Classification (category)
- Predicting tomorrow's temperature (72.5°F) → Regression (continuous number)
- Predicting customer segment → Clustering (group)
Linear Regression
Linear regression is the simplest and most fundamental regression algorithm. It finds the straight-line relationship between features and the label.
Simple Linear Regression (One Feature)
With one feature, linear regression fits a straight line through the data:
Formula: y = mx + b
- y = predicted value (label)
- x = input value (feature)
- m = slope (how much y changes when x changes by 1)
- b = y-intercept (value of y when x = 0)
Example: Predicting house price (y) based on square footage (x)
- If m = 200 and b = 50,000
- A 1,500 sq ft house: y = 200(1,500) + 50,000 = $350,000
Multiple Linear Regression (Multiple Features)
With multiple features, the model accounts for several input variables:
Formula: y = m₁x₁ + m₂x₂ + m₃x₃ + ... + b
Example: Predicting house price based on square footage (x₁), bedrooms (x₂), and age (x₃)
Regression Evaluation Metrics
After training a regression model, you evaluate its performance using these metrics:
Mean Absolute Error (MAE)
The average absolute difference between predicted and actual values.
- Interpretation: On average, how far off are the predictions?
- Example: MAE of $15,000 means predictions are off by $15,000 on average
- Lower is better
Root Mean Squared Error (RMSE)
The square root of the average squared differences between predicted and actual values.
- Interpretation: Similar to MAE but penalizes larger errors more heavily
- Example: RMSE of $20,000 (larger than MAE because big errors are penalized more)
- Lower is better
R-squared (Coefficient of Determination)
The proportion of variance in the label that the model explains.
- Range: 0 to 1 (can be negative for very poor models)
- Interpretation: How much of the variation in the output does the model explain?
- Example: R² = 0.85 means the model explains 85% of the variation in house prices
- Higher is better (1.0 = perfect, 0 = no better than predicting the average)
| Metric | What It Measures | Good Values | Direction |
|---|---|---|---|
| MAE | Average prediction error | Close to 0 | Lower is better |
| RMSE | Average error (penalizes large errors) | Close to 0 | Lower is better |
| R-squared | Proportion of variance explained | Close to 1.0 | Higher is better |
On the Exam: You will NOT be asked to calculate these metrics. You need to know what each metric means, how to interpret it, and which direction is "better." Common question: "An R-squared of 0.92 indicates that..." → the model explains 92% of the variance in the data.
Common Regression Use Cases
| Use Case | Features | Label |
|---|---|---|
| House price prediction | Size, bedrooms, location | Price ($) |
| Sales forecasting | Month, promotions, weather | Sales volume |
| Temperature prediction | Date, location, humidity | Temperature (°F) |
| Stock price prediction | Volume, market index, news | Stock price ($) |
| Delivery time estimation | Distance, traffic, weather | Delivery time (minutes) |
| Insurance premium pricing | Age, health history, coverage | Premium ($/month) |
| Energy consumption | Time, weather, building size | Energy (kWh) |
Regression vs. Classification: The Key Difference
| Aspect | Regression | Classification |
|---|---|---|
| Output type | Continuous number | Discrete category |
| Examples | $450,000, 72.5°F, 3.7 hours | Spam/Not Spam, Cat/Dog, Yes/No |
| Question answered | "How much?" or "How many?" | "Which category?" or "Is it X?" |
| Evaluation | MAE, RMSE, R-squared | Accuracy, Precision, Recall, F1 |
What does an R-squared value of 0.92 indicate about a regression model?
Which machine learning technique would you use to predict the temperature in a city tomorrow?
Which regression evaluation metric penalizes larger prediction errors more heavily?
A model predicts house prices. The MAE is $25,000 and the R-squared is 0.78. What does this tell you?