5.2 Statistics & Data Analysis
Key Takeaways
- Mean and median measure center; the median resists outliers while the mean is pulled toward them, so skewed data is better summarized by the median.
- Spread is measured by range, interquartile range (IQR = Q3 − Q1), and standard deviation; a larger standard deviation means data is more spread from the mean.
- The correlation coefficient r always lies between −1 and 1: the sign gives direction and |r| near 1 gives a strong linear relationship.
- Correlation does not imply causation — a strong r can come from a lurking variable or coincidence.
- In a two-way frequency table, joint frequencies are single cells, marginal frequencies are row/column totals, and conditional relative frequency divides within a single row or column.
Why Statistics Matters
Statistics and Probability is 7%–15% of Algebra I Regents credits, and the questions reward interpretation. You must read a graph or table, pick the right summary number, and explain what it means in context. The two big jobs are summarizing one variable (center, spread, shape) and analyzing the relationship between two variables (scatter plots and correlation).
Measures of Center and Spread
Center tells you the typical value:
- Mean — the average; add all values and divide by the count. It is pulled toward outliers.
- Median — the middle value when data is ordered. It resists outliers.
Spread tells you how scattered the data is:
- Range = maximum − minimum.
- Interquartile range (IQR) = Q3 − Q1 (the middle 50% of the data). IQR resists outliers.
- Standard deviation — the typical distance of values from the mean. A larger standard deviation means more spread; a value of 0 means every data point is identical.
Worked example. For 4, 5, 6, 7, 8 the mean is 30/5 = 6 and the median is 6. Now change the last value to 38: 4, 5, 6, 7, 38. The mean jumps to 60/5 = 12, but the median stays 6. The outlier dragged the mean but not the median — which is why skewed data is summarized by the median.
Shape, Outliers, and Comparing Distributions
The shape of a distribution can be roughly symmetric, skewed right (a long tail of large values), or skewed left. In a right-skewed distribution the mean is greater than the median; in a left-skewed one the mean is less than the median.
An outlier is a value far from the rest of the data. A common rule flags points more than 1.5 × IQR beyond Q1 or Q3.
Three displays let you compare groups:
| Display | Best for |
|---|---|
| Dot plot | Small data sets; shows every value and clusters |
| Histogram | Larger data; shows shape across intervals (bins) |
| Box plot | Five-number summary (min, Q1, median, Q3, max); compares spread and skew between groups |
When comparing two box plots, compare medians for center and box widths (IQR) for spread — a wider box means more variability.
A data set of home prices includes one mansion priced far above the rest. Which pair of statistics is least affected by that outlier?
Scatter Plots, Line of Best Fit, and the Correlation Coefficient
For bivariate (two-variable) data, a scatter plot graphs ordered pairs to reveal a relationship. If the points trend along a line, a line of best fit (linear regression line) summarizes the pattern, and a graphing calculator computes it as y = ax + b.
The correlation coefficient r measures the strength and direction of a linear relationship and always satisfies −1 ≤ r ≤ 1:
- The sign gives direction: r > 0 means as x increases y tends to increase (positive); r < 0 means y tends to decrease (negative).
- The magnitude gives strength: |r| near 1 is a strong linear fit, |r| near 0 is weak or no linear fit.
| r value | Interpretation |
|---|---|
| r = 1 | Perfect positive linear |
| r ≈ 0.85 | Strong positive |
| r ≈ 0.3 | Weak positive |
| r = 0 | No linear relationship |
| r ≈ −0.9 | Strong negative |
| r = −1 | Perfect negative linear |
A value like r = 1.4 is impossible — it lies outside [−1, 1] and signals an error.
Residuals and Correlation vs Causation
A residual is the gap between an actual data value and the value the line of best fit predicts: residual = actual − predicted. A positive residual means the point sits above the line; a negative residual means below. A residual plot with no clear pattern (randomly scattered around zero) suggests the linear model is appropriate; a curved pattern suggests a line is the wrong model.
Correlation does not imply causation. A strong r tells you two variables move together, not that one causes the other. Ice-cream sales and drowning rates both rise in summer — a lurking variable (hot weather) drives both. On the Regents, a high r never lets you conclude one variable causes the change in the other; only a controlled experiment supports causation.
Worked example. A best-fit line predicts a test score of 78 for a student who studied 4 hours, but the student actually scored 85. The residual is 85 − 78 = +7, meaning the point lies 7 points above the prediction line.
Two-Way Frequency Tables
A two-way frequency table sorts data by two categorical variables. Consider 100 students surveyed about owning a pet and playing a sport:
| Plays sport | No sport | Total | |
|---|---|---|---|
| Owns pet | 30 | 20 | 50 |
| No pet | 25 | 25 | 50 |
| Total | 55 | 45 | 100 |
Three kinds of frequency come from this table:
- Joint frequency — a single inner cell. Students who own a pet and play a sport = 30.
- Marginal frequency — a row or column total. Total pet owners = 50; total athletes = 55.
- Conditional relative frequency — a cell divided by a row or column total. Among pet owners, the fraction who play a sport = 30/50 = 0.60. Among athletes, the fraction who own a pet = 30/55 ≈ 0.55.
The key Regents skill is reading the condition: “among pet owners” means divide by the pet-owner row total (50), not by the grand total (100).
Using the table above (100 students), what is the conditional relative frequency that a student plays a sport, given that the student has no pet?
You've completed this section
Continue exploring other exams