5.2 Psychometric Foundations: Reliability, Validity, Norms, and Error
Key Takeaways
- Reliability concerns consistency of measurement; validity concerns whether evidence supports the intended interpretation and use of a score.
- A test can be reliable without being valid for a specific referral question, population, language, or decision.
- Standard error of measurement, base rates, sensitivity, specificity, and predictive values all change how a score should be interpreted.
- EPPP items usually ask for the best interpretation of imperfect assessment data, not a memorized definition.
Psychometrics as Clinical Risk Management
Psychometrics keeps assessment from becoming impressionistic. Every score is a sample of behavior obtained under specific conditions, so the psychologist must ask whether the score is consistent enough to matter, whether the interpretation is supported, and whether the score should drive the decision at hand.
Reliability is consistency. Test-retest reliability concerns stability over time; interrater reliability concerns agreement among scorers; internal consistency (often summarized by Cronbach's alpha) concerns whether items measure a related construct; and alternate-form reliability concerns equivalence across versions. As a rough EPPP heuristic, reliability coefficients around .90 or higher are expected for high-stakes individual decisions, while values near .70 to .80 are acceptable for group research but shaky for clinical calls about one person.
Low reliability widens error and weakens confidence in any individual conclusion.
Validity is the body of evidence supporting a particular interpretation and use. Content validity asks whether the test samples the relevant domain; criterion-related validity (concurrent or predictive) asks whether scores relate to an outcome; construct validity asks whether the test behaves as theory predicts, with convergent evidence (correlating with similar measures) and discriminant evidence (not correlating with unrelated ones). Consequential considerations ask whether use of the test predictably helps or harms in a setting.
A test can be highly reliable yet invalid for a given purpose. A depression inventory may consistently measure current distress and still tell you nothing valid about parenting capacity, malingering, or neurocognitive decline. The EPPP answer is frequently the one that limits the conclusion to the available evidence.
| Concept | Meaning | Practice implication |
|---|---|---|
| Reliability | Consistency of measurement | Lower reliability means wider uncertainty around a score |
| Validity | Evidence for interpretation and use | Evidence must match the specific referral question |
| Norms | The comparison sample for scores | Norm group must fit age, language, culture, education |
| Standard error of measurement | Expected imprecision of an observed score | Report confidence intervals, not bare point scores |
| Base rate | How common a condition is in the population | Rare conditions create more false positives on broad screens |
Norms are not decoration. A score is meaningful only against an appropriate reference group or criterion. Most cognitive and personality measures report scores in familiar metrics: a standard score with mean 100 and standard deviation 15 (so 85 to 115 is the average range, plus or minus one SD), T-scores with mean 50 and SD 10 (so 65 and above often flags clinical elevation, two SDs up), scaled scores with mean 10 and SD 3, and percentile ranks.
Age, education, language proficiency, disability, acculturation, and medical status can all shift the meaning of performance, so mismatched norms must be qualified, replaced, or supplemented with converging evidence.
The standard error of measurement (SEM) reminds candidates that observed scores are estimates. A single score should never be treated as exact, especially near a cut point or in a high-stakes decision. A 95% confidence interval spans roughly plus or minus two SEMs; an IQ of 98 with an SEM of 3 has a confidence band of about 92 to 104, which can straddle a classification boundary. Regression to the mean further warns that extreme scores tend to move toward average on retest.
Sensitivity and specificity appear constantly in screening items. Sensitivity is the proportion of true cases the test correctly flags; specificity is the proportion of non-cases it correctly clears. Positive predictive value (PPV) and negative predictive value (NPV) depend on the base rate, so the same instrument performs differently in a specialty clinic (high base rate, high PPV) than in a low-prevalence community screen (low PPV, many false positives).
When you see a sensitive screen used in a low-base-rate population, expect the EPPP to reward the answer that says a positive result needs confirmatory follow-up before diagnosis.
Response-style and validity indicators require the same restraint. Overreporting, underreporting, inconsistency, defensiveness, random responding, fatigue, low literacy, or cultural mismatch can all distort scores. The correct move is rarely to discard every datum; it is to interpret cautiously, document the limit, and seek additional evidence when the referral question still matters.
For exam scenarios, run this checklist:
- Identify the construct and the decision being made.
- Ask what reliability evidence the decision demands.
- Ask what validity evidence supports the proposed interpretation.
- Confirm the norm group, language, and education fit the examinee.
- Account for SEM, base rates, and response style.
- Integrate the score with interview, observation, records, and collateral data.
Psychometric competence protects clients and institutions from overconfidence and improves communication: a clear report states what a score supports, what it does not support, how much uncertainty remains, and what additional data would change the conclusion.
Common Psychometric Traps the EPPP Sets
Several psychometric distinctions appear so often that they deserve explicit drilling. The first is reliability versus validity: a coefficient near .95 tells you the instrument is consistent, never that it measures the right thing for the present decision. Whenever an option says a test is valid 'because it is reliable,' that option is wrong. A second trap concerns the ceiling on validity: a test's validity is bounded by its reliability, because you cannot validly measure a construct you cannot measure consistently.
This is why low reliability is fatal for individual high-stakes decisions even when group-level validity studies look acceptable.
A third frequent trap involves base rates and predictive value. Candidates who memorize that a test has '90% sensitivity' may forget that in a low-prevalence setting most positive results are still false positives. The EPPP rewards the reasoning that prevalence drives positive predictive value, so a positive screen for a rare condition warrants a confirmatory evaluation rather than an immediate diagnosis. A fourth trap is treating an observed score as exact. The presence of a confidence interval or standard error of measurement in an option usually signals the more defensible answer, because it acknowledges imprecision near a cut score.
A fifth area is score metrics and their relationships. Know that a percentile rank is not an interval scale, so the difference between the 50th and 60th percentile is not the same amount of ability as the difference between the 90th and 99th percentile; scores bunch in the middle of a normal distribution. Standard scores, T-scores, scaled scores, and z-scores are interval and can be compared with one another once converted to a common metric.
Roughly 68% of a normal distribution falls within one standard deviation of the mean, about 95% within two, and about 99.7% within three, which is the logic behind flagging a T-score of 70 (two SDs up) or a standard score of 70 (two SDs down) as clinically notable. Anchoring these conversions early prevents arithmetic errors under exam time pressure, where a single misread metric can flip an entire answer.
A test yields nearly identical scores across two administrations but has little evidence that it predicts the referral outcome. Which statement is best?
A screening tool with strong sensitivity is used where the target condition is uncommon. What should the psychologist remember?
An examinee scores an IQ of 98 with a standard error of measurement of 3. Which interpretation reflects sound psychometrics?