7.3 Measurement, Reliability, and Validity

Key Takeaways

  • Reliability concerns consistency of measurement; validity concerns the meaning and appropriate use of scores.
  • A measure can be reliable without being valid for a particular decision, but it cannot be valid without being reliable.
  • Validity evidence comes from content, criterion-related (predictive/concurrent), and construct sources (convergent and discriminant).
  • Sensitivity, specificity, and base rates govern classification accuracy and the meaning of positive screens.
Last updated: June 2026

Scores Are Evidence, Not Magic

Psychological research depends on measurement; a weak measure can sink an elegant design. EPPP measurement items ask whether a test, rating scale, observation system, or outcome measure is consistent enough and meaningful enough for the proposed use. The safest reasoning links the score to the decision being made.

Reliability is consistency. The four classic forms each estimate a different source of error:

Reliability typeWhat it checksTypical statistic / cue
Test-retestStability across time (stable trait)Correlation between two administrations
InterraterAgreement among observers/codersCohen's kappa, intraclass correlation
Internal consistencyDo items hang together?Cronbach's alpha, split-half
Alternate-formsEquivalence of two versionsCorrelation between Form A and Form B

A reliability coefficient is evidence about consistency under specified conditions, not a moral rating of a test. Reliability is bounded by error: high random measurement error lowers all observed correlations and attenuates effect estimates, which is why an unreliable measure can hide a real effect.

Validity concerns whether evidence and theory support the interpretation and use of scores. Validity is not a permanent property of a test; it is conditional on population and purpose. A depression screener may be valid for symptom screening in primary care yet inadequate for diagnosis, disability determination, or forensic conclusions. The EPPP frequently rewards the option that asks whether the measure was validated for this population and this decision.

Validity Evidence and Classification Accuracy

Modern psychometrics treats validity as a unified judgment supported by several sources of evidence:

  • Content validity evidence — do the items adequately sample the construct's domain? Subject-matter experts review coverage and relevance.
  • Criterion-related evidence — does the score relate to an external criterion? It is predictive when the score forecasts a later outcome (admissions test predicting GPA) and concurrent when it relates to a present criterion (a new screener correlated with an established one).
  • Construct validity evidence — does the measure behave as the construct predicts? Convergent evidence means it correlates with similar constructs; discriminant evidence means it does not correlate strongly with unrelated constructs (the multitrait-multimethod logic).

Reliability is necessary but not sufficient for validity. A bathroom scale that is always five pounds heavy is perfectly consistent yet inaccurate. A psychometrically tight scale can still fail if items misrepresent the construct, language is inappropriate for the client group, or the score is stretched beyond its validation evidence.

Classification measures add sensitivity and specificity:

  • Sensitivity = proportion of true cases correctly identified (true-positive rate). Screeners prioritize sensitivity because missing a serious condition is costly.
  • Specificity = proportion of true non-cases correctly identified (true-negative rate). Confirmatory decisions prioritize specificity to limit false positives.

Base rates modulate both. Positive predictive value (PPV) — the chance a positive screen reflects a true case — falls sharply when the disorder is rare, even with excellent sensitivity. Screening a low-prevalence population can produce many false positives, which is why a positive screen warrants confirmatory assessment rather than immediate diagnosis. For exam questions, slow down when an option says valid without saying valid for what; the stronger option names the construct, population, decision, and evidence.

Score Interpretation and Standardized Metrics

Measurement on the EPPP also covers how raw scores become interpretable. Standardization places a score within a reference distribution. A z score has a mean of 0 and standard deviation of 1. A T score has a mean of 50 and SD of 10 (used by the MMPI and many clinical inventories), so a T of 70 is two SD above the mean. A standard IQ/index score typically has a mean of 100 and SD of 15, so a score of 130 sits two SD above the mean, around the 98th percentile.

Percentile ranks report the percentage of the norm group scoring at or below a value; they are not equal-interval, so percentile differences are compressed in the middle and stretched at the tails of a normal curve.

MetricMeanSDA score 2 SD above mean
z score01+2.0
T score501070
IQ / index score10015130
Wechsler subtest scaled score10316

Two concepts govern the precision of an individual score. The standard error of measurement (SEM) estimates the band of error around an observed score; it shrinks as reliability rises, and a confidence interval built from the SEM (observed score plus or minus about 1.96 SEM for 95%) communicates how much a score might fluctuate on retest. The EPPP rewards reporting a score with its confidence band rather than as a fixed point, because two clients whose scores differ by a few points may not differ reliably.

Finally, beware norm-referenced versus criterion-referenced interpretation. A norm-referenced score (percentile, T score) says how a person compares to others; a criterion-referenced score says whether the person met an absolute standard (mastered the skill, exceeded a cutoff). A measure can be excellent for one purpose and inappropriate for the other, and using outdated norms (the Flynn effect, the gradual rise in population IQ over decades, inflates scores against old norms) is a recurring fairness and validity concern.

The exam-ready habit is to tie every score to its norm group, its purpose, and the error around it before drawing a conclusion.

Test Your Knowledge

Which statement best distinguishes reliability from validity?

A
B
C
D
Test Your Knowledge

A scale shows Cronbach's alpha of .94, but expert review finds its items omit major facets of the construct. What is the main concern?

A
B
C
D
Test Your Knowledge

A highly sensitive screener is applied to a population where the disorder is rare. What is the most likely consequence?

A
B
C
D