2.3 Data Analysis and Graphing
Key Takeaways
- Mean is the arithmetic average; median is the middle value; mode is the most frequent value; range is the spread (max minus min); standard deviation quantifies how tightly data cluster around the mean.
- A p-value below 0.05 indicates that the observed result would occur by chance less than 5% of the time if the null hypothesis were true — by convention, this threshold is called 'statistically significant.'
- A t-test compares two group means for continuous data; a chi-square goodness-of-fit test compares observed counts to expected counts in categorical data.
- Correlation (Pearson r between -1 and +1) measures the strength and direction of a linear relationship but does NOT establish causation.
- Bar graphs compare categorical groups, line graphs show change over a continuous variable (time, temperature, concentration), and scatter plots reveal the relationship between two continuous variables.
Why This Section Matters
The Praxis exam does not require you to crunch large data sets, but it does demand that you interpret standard descriptive and inferential statistics and select the right graph type for a data set. Questions appear both in the Nature of Science subarea and embedded in content questions across genetics, ecology, and physiology.
Descriptive Statistics
For a small data set, the measures of central tendency are:
- Mean = sum of values / number of values. Most useful for symmetric, continuous data.
- Median = middle value when data are ordered (or average of the two middle values). Robust to outliers.
- Mode = most frequent value. Useful for categorical data (e.g., most common eye color).
The measures of spread are:
- Range = maximum minus minimum.
- Variance (s^2) = average squared deviation from the mean.
- Standard deviation (s) = square root of variance. Reports spread in the original units.
Worked Example
Seven leaf lengths in cm: 6, 7, 7, 8, 9, 10, 14.
- Mean = 61 / 7 = 8.71 cm
- Median = 4th value = 8 cm
- Mode = 7 cm (the only repeated value)
- Range = 14 - 6 = 8 cm
- Standard deviation = approximately 2.69 cm
Notice that the outlier (14 cm) pulls the mean higher than the median. When a distribution is skewed, the median is a more representative "typical value."
Error Bars
Error bars on a graph represent variability — either standard deviation, standard error, or 95% confidence interval. The label matters:
- Bars of +/- 1 standard deviation show how spread out individual data points are.
- Bars of +/- 1 standard error of the mean (SEM) show how precisely the mean is estimated; SEM = s / sqrt(n) and shrinks with larger sample size.
- Overlapping error bars between two groups roughly suggest no significant difference, but a formal test (t-test) is needed to confirm.
Inferential Statistics for the Praxis
The p-value
A p-value is the probability of observing a result at least as extreme as the one obtained, assuming the null hypothesis (no effect) is true.
- p < 0.05 -> conventionally statistically significant; we reject the null hypothesis.
- p > 0.05 -> we fail to reject the null. (Note: failing to reject is not the same as proving no effect.)
A p-value is not the probability that the treatment works, nor the size of the effect. Praxis stems frequently include a wrong answer that says exactly that.
The t-test
A two-sample t-test compares the means of two groups on a continuous measurement (heights, enzyme rates, heart rates). Output is a t-statistic and a p-value. Use a t-test when:
- Data are continuous.
- You have two groups (e.g., control vs. treatment).
- Data are approximately normally distributed.
For more than two groups, use ANOVA.
The chi-square goodness-of-fit test
A chi-square (X^2) goodness-of-fit test compares observed counts to counts expected under a hypothesis — perfect for testing Mendelian ratios.
X^2 = Sum [ (O - E)^2 / E ]
where O is observed count and E is expected count.
Example: A monohybrid cross predicts a 3:1 ratio of yellow : green peas. Out of 80 offspring you observe 58 yellow and 22 green. Expected: 60 yellow, 20 green.
X^2 = (58 - 60)^2 / 60 + (22 - 20)^2 / 20 = 4/60 + 4/20 = 0.067 + 0.200 = 0.27.
With 1 degree of freedom, the critical X^2 value at p = 0.05 is 3.84. Since 0.27 < 3.84, we fail to reject the 3:1 hypothesis — the observed data are consistent with Mendel's prediction.
Correlation vs. Causation
Correlation measures whether two continuous variables move together.
- Pearson r ranges from -1 to +1.
- r near +1 = strong positive correlation; r near -1 = strong negative correlation; r near 0 = no linear relationship.
- The square of r, r^2 (coefficient of determination), tells you the proportion of variance in one variable explained by the other.
Causation means changes in one variable directly produce changes in the other. Correlation alone never proves causation, because of:
- Reverse causation - the effect actually causes the supposed cause.
- Confounding variables - a third factor causes both.
- Chance - a coincidence in the sample.
The only way to establish causation is through a controlled experiment with random assignment.
Choosing the Right Graph
| Graph | Use When | Example |
|---|---|---|
| Bar graph | Comparing values across discrete categories | Mean leaf length by tree species |
| Histogram | Showing the distribution of a single continuous variable | Frequency of leaf lengths across all trees |
| Line graph | Showing change in one variable over a continuous independent variable (often time) | Population size over 10 years; enzyme activity vs. temperature |
| Scatter plot | Showing the relationship between two continuous variables | Height vs. weight; CO2 vs. photosynthesis rate |
| Pie chart | Showing parts of a whole for a single category | Composition of a community by phylum |
| Box-and-whisker plot | Comparing distributions including median and IQR across groups | Test scores by classroom |
Interpreting Trends
- A plateau on a rate curve indicates that another factor has become limiting (Liebig's Law of the Minimum).
- A sigmoidal (S-shaped) curve is typical of logistic population growth.
- A bell-shaped (normal) distribution results from many small independent factors and underlies most parametric tests.
- An exponential curve describes unconstrained growth or first-order decay (radioactive isotopes used in dating).
Praxis Trap to Avoid
When a question reports a correlation in an observational study (e.g., "students who eat breakfast score higher on tests"), the answer that calls it proof of causation is wrong. The answer that flags a possible confound (socioeconomic status, parent involvement) is the scientifically correct one.
In a study of leaf widths from two oak species, Species A has a mean width of 6.5 cm (SD = 0.3 cm, n = 30) and Species B has a mean width of 6.8 cm (SD = 0.4 cm, n = 30). A two-sample t-test returns p = 0.002. Which conclusion is best supported?
A class crosses heterozygous purple-flowered pea plants (Pp x Pp) and counts 168 purple and 72 white offspring among 240 F2 plants. They want to test whether the data fit Mendel's predicted 3:1 ratio. Which statistical test is the appropriate choice and what are the expected counts?