7.3 Measurement, Reliability, and Validity

Key Takeaways

  • Reliability concerns consistency of measurement, while validity concerns the meaning and use of scores.
  • A measure can be reliable without being valid for a particular decision.
  • Construct, criterion-related, content, convergent, and discriminant evidence help evaluate whether scores support the intended inference.
  • Measurement quality affects research conclusions, assessment decisions, and evidence-based practice.
Last updated: May 2026

Scores Are Evidence, Not Magic

Psychological research depends on measurement. If the measure is weak, even an elegant design can produce a shaky conclusion. On the EPPP, measurement questions often ask whether a test, rating scale, observation system, interview code, or outcome measure is consistent enough and meaningful enough for the proposed use. The safest reasoning is to link the score to the decision being made.

Reliability refers to consistency. Test-retest reliability concerns stability over time. Interrater reliability concerns agreement among observers or coders. Internal consistency concerns whether items on a scale are measuring related content. Alternate-forms reliability concerns whether different versions produce comparable scores. A reliability coefficient is not a moral rating of a test; it is evidence about consistency under specified conditions.

Measurement conceptWhat it asksCommon EPPP cue
Test-retest reliabilityAre scores stable across time when the construct should be stable?Re-administering a measure after a short interval.
Interrater reliabilityDo observers score the same behavior similarly?Multiple clinicians code recorded sessions.
Internal consistencyDo items on a scale hang together?Items are intended to assess one construct.
Content validity evidenceDoes the measure cover the domain adequately?Subject matter experts review item coverage.
Criterion-related evidenceDoes the score relate to an outcome or criterion?Scores predict later functioning or correlate with an established measure.

Validity concerns whether evidence and theory support the interpretation and use of scores. A test does not have one permanent validity status for every setting. A depression screener may be valid for initial symptom screening in one population but insufficient for diagnosis, disability determination, or high-stakes forensic conclusions. The EPPP often rewards the answer that asks whether the test was validated for the population and purpose in the vignette.

Construct validity evidence asks whether a measure behaves as expected if it truly reflects the construct. Convergent evidence means the measure relates to similar constructs or established instruments. Discriminant evidence means it does not relate too strongly to different constructs. Criterion-related evidence can be predictive, when the score forecasts a later outcome, or concurrent, when it relates to a present criterion.

Reliability is necessary but not sufficient for validity. A bathroom scale that is always five pounds off is consistent but inaccurate for actual weight. In psychology, a highly consistent measure can still fail if the items do not represent the construct, the language is inappropriate for the client group, or the score is used for a decision beyond the validation evidence.

Measurement also includes sensitivity and specificity for classification. Sensitivity is the ability to detect true cases. Specificity is the ability to identify non-cases. A screening tool usually values sensitivity because missing a serious condition can be costly. A confirmatory decision may require more specificity to avoid false positives. Base rates also matter: when a condition is rare, false positives can become a larger practical issue.

For exam questions, slow down when an answer says valid without saying valid for what. The stronger option identifies the intended construct, population, decision, and evidence. That habit aligns research methods with assessment ethics and clinical judgment.

Test Your Knowledge

Which statement best distinguishes reliability from validity?

A
B
C
D
Test Your Knowledge

A measure has high internal consistency but its items do not cover the construct being assessed. What is the main concern?

A
B
C
D
Test Your Knowledge

What does sensitivity describe in a classification measure?

A
B
C
D