5.1 Classifying data & data displays
Key Takeaways
- Categorical data label groups (blood type, eye color); quantitative data measure or count and allow arithmetic.
- Quantitative data are discrete (counted, with gaps) or continuous (measured, any value in a range).
- Bar graphs and two-way tables display categorical data; dot plots, histograms, and box plots display quantitative data.
- Histogram bars touch because the axis is numeric and ordered; bar-graph bars are separated and can be reordered.
- A box plot's five-number summary splits data into four ~25% sections; a wider section means more spread, not more values.
Two Big Questions About Any Dataset
Statistics begins with two questions: What kind of data do I have, and what is the best way to display it? The TSIA2 expects you to classify variables correctly and read common graphs without being fooled by their appearance.
Categorical vs. Quantitative Data
Categorical (qualitative) data sort items into groups or labels. Examples: eye color, favorite subject, blood type, or a yes/no survey answer. Even when categories are written as numbers (like jersey numbers or zip codes), you cannot do meaningful arithmetic with them — averaging jersey numbers is nonsense.
Quantitative (numerical) data measure or count something, so arithmetic makes sense. Quantitative data split further:
- Discrete — countable values with gaps (number of pets: 0, 1, 2, ...).
- Continuous — any value in a range (height, weight, elapsed time).
A quick test: if "What is the average?" makes sense, the data are quantitative; if only "How many are in each group?" makes sense, the data are categorical.
| Variable | Type | Why |
|---|---|---|
| Blood type (A, B, AB, O) | Categorical | Labels, no arithmetic |
| Number of siblings | Quantitative (discrete) | Counted whole numbers |
| Race finish time | Quantitative (continuous) | Measured, any value |
| T-shirt size (S, M, L) | Categorical (ordered) | Ranked labels |
Displays for Categorical Data
Bar graph — separated bars, one per category; each bar's height shows a frequency or percent. Because categories have no fixed numeric order, the bars can be rearranged. Do not confuse a bar graph with a histogram.
Two-way table — cross-classifies two categorical variables. Rows and columns give counts, and the bottom-right corner is the grand total. To read one, find the cell where a row meets a column.
Worked example: A survey of 200 students records grade level and whether they ride the bus.
| Bus | No Bus | Total | |
|---|---|---|---|
| 9th | 45 | 30 | 75 |
| 10th | 35 | 90 | 125 |
| Total | 80 | 120 | 200 |
The fraction of 10th graders who ride the bus is 35/125 = 0.28 = 28%. Notice the denominator is the 10th-grade total (125), not the whole 200 — mixing up the denominator is a common trap.
Displays for Quantitative Data
Dot plot — each value is a dot above a number line, and stacked dots show repeats. It works well for small datasets because you can see clusters and gaps directly.
Histogram — the bars touch because the horizontal axis is a continuous number line divided into equal intervals (bins). Each bar's height is the count of values in that bin. Since the x-axis is numeric, the bars cannot be reordered. Shape matters: a tail stretching to the right is skewed right, a tail to the left is skewed left, and a balanced mound is symmetric.
Box plot (box-and-whisker) — built from the five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. The box spans Q1 to Q3, the line inside marks the median, and the whiskers reach the extremes. The box holds the middle 50% of the data.
Worked example: For the five-number summary min = 12, Q1 = 20, median = 26, Q3 = 35, max = 50, the box runs from 20 to 35 with a line at 26. The interquartile range is Q3 - Q1 = 35 - 20 = 15, so the middle half of the values spans 15 units.
Scatterplot — plots paired (x, y) data as points to reveal a relationship. Read the direction (positive: points rise; negative: points fall), the form (linear or curved), and the strength (tightly clustered vs. widely scattered). A rising cloud of points signals a positive association, but it does not by itself prove that one variable causes the other.
Frequency vs. Relative Frequency
A display can show frequency (a raw count) or relative frequency (a proportion or percent of the total). To convert a count to a relative frequency, divide it by the grand total. If a histogram shows 8 students scoring in the 70-80 bin out of 40 students in all, the relative frequency is 8 / 40 = 0.20 = 20%. Relative frequencies let you compare groups of different sizes fairly, because a count of 8 means much more in a class of 20 than in a class of 200.
Worked example (dot plot): Suppose a dot plot has 3 dots above 5, 2 dots above 6, and 1 dot above 8. There are 3 + 2 + 1 = 6 data values in all, the tallest stack is over 5 so the mode is 5, and the lone value at 8 sits apart from the cluster as a possible outlier. Reading a dot plot is just counting dots.
Reading Displays Carefully
- Always read the axis labels and scale first — a truncated y-axis exaggerates differences.
- For a histogram, add bar heights to get totals (for example, the counts in the 10-20 and 20-30 bins).
- For a box plot, remember each section (min-Q1, Q1-median, median-Q3, Q3-max) contains about 25% of the data, even when the sections have different widths. A wide section means the data are spread out there, not that more values live there.
Match the display to the data type: categorical data go with a bar graph or two-way table; quantitative data go with a dot plot, histogram, or box plot; and paired quantitative data go with a scatterplot. Choosing the wrong display, such as a histogram for eye color, is a classic error the TSIA2 likes to test.
Which of the following variables is categorical?
In the two-way table of 200 students (9th: 45 bus, 30 no bus; 10th: 35 bus, 90 no bus), what fraction of 10th graders ride the bus?
Which display is best for showing the relationship between two quantitative variables measured on the same individuals?