5.2 Psychometric Foundations: Reliability, Validity, Norms, and Error
Key Takeaways
- Reliability concerns consistency of measurement, while validity concerns whether evidence supports the intended interpretation and use.
- A test can be reliable without being valid for a specific referral question, population, language, or decision.
- Norms, standard error, base rates, sensitivity, specificity, and predictive values all affect interpretation.
- EPPP items often ask for the best interpretation of imperfect assessment data rather than a memorized definition.
Psychometrics as Clinical Risk Management
Psychometrics is the science that keeps assessment from becoming impressionistic. In practice, every score is a sample of behavior obtained under specific conditions. A psychologist must ask whether the score is consistent enough to matter, whether the interpretation is supported, and whether the score should influence the decision at hand.
Reliability refers to consistency. Test-retest reliability concerns stability over time. Interrater reliability concerns agreement among raters. Internal consistency concerns whether items on a scale measure a related construct. Alternate-form reliability concerns consistency across equivalent versions. Low reliability increases error and weakens confidence in individual decisions.
Validity refers to evidence for interpretation and use. Content evidence asks whether the test samples the relevant domain. Criterion-related evidence asks whether scores relate to an outcome or external standard. Construct evidence asks whether the test behaves as theory predicts. Consequential considerations ask whether use of the test creates predictable harms or benefits in a setting.
A test can be highly reliable and still invalid for a particular purpose. A depression scale may consistently measure current distress, but it may not validly determine parenting capacity, malingering, neurocognitive impairment, or workplace safety by itself. The EPPP answer is often the one that limits conclusions to the available evidence.
| Concept | Meaning | Practice implication |
|---|---|---|
| Reliability | Consistency of measurement | Lower reliability means wider uncertainty |
| Validity | Support for interpretation and use | Evidence must match the referral question |
| Norms | Comparison group for scores | Norm group must fit age, language, culture, and context |
| Standard error | Expected score imprecision | Interpret ranges and confidence, not only point scores |
| Base rate | How common a condition is in the population | Rare conditions create more false positives when screens are broad |
Norms are not decoration. A score is meaningful only against an appropriate reference group or criterion. Age, education, language proficiency, disability, acculturation, medical status, and setting can change the meaning of performance. When norms are mismatched, the psychologist should qualify the interpretation, seek better instruments, consult, or use converging evidence.
The standard error of measurement reminds candidates that observed scores are imperfect estimates. A single score should not be treated as exact, especially near a cut point or when the decision has high stakes. Confidence intervals, behavioral observations, history, and collateral data help keep the conclusion proportional to the evidence.
Sensitivity and specificity are common in screening and diagnostic assessment. Sensitivity concerns the ability to identify people who have the condition. Specificity concerns the ability to identify people who do not have it. Positive predictive value and negative predictive value depend on base rates, so the same test can perform differently in specialty clinics, community samples, and forensic settings.
Validity indicators and response style measures also require careful interpretation. Overreporting, underreporting, inconsistency, defensiveness, random responding, fatigue, low literacy, misunderstanding, and cultural mismatch can all affect scores. The answer is not automatically to discard all data. The answer is to interpret cautiously, document limits, and seek additional evidence when the referral question remains important.
For exam scenarios, follow this checklist:
- Identify the construct and the decision being made.
- Ask what reliability evidence matters for that decision.
- Ask what validity evidence supports the proposed interpretation.
- Check whether the norm group and language fit the examinee.
- Consider error, base rates, and response style.
- Integrate scores with interview, observation, records, and collateral data.
Psychometric competence protects clients and institutions from overconfidence. It also improves communication. A clear report explains what a score supports, what it does not support, how much uncertainty remains, and what additional data would change the conclusion.
A test produces very similar scores across two administrations but has little evidence for predicting the referral outcome. Which statement is best?
A screening tool has strong sensitivity but the condition is uncommon in the referral population. What should the psychologist remember?
Which factor most directly weakens the interpretation of a norm-referenced score?