8.1 Assessment and Testing Overview

Key Takeaways

  • Assessment (the CACREP "Appraisal" area) is one of 8 equally weighted CPCE domains: 20 items, 17 scored, so 12.5% of the 136 scored questions.
  • Reliability = consistency of scores; validity = whether the test measures what it claims and supports the intended interpretation.
  • A test can be reliable without being valid, but it cannot be valid without first being reliable.
  • Standardized scores (z, T, standard score, percentile, stanine) all locate a person relative to a norm group; learn the conversions cold.
Last updated: June 2026

8.1 Assessment and Testing Overview

This chapter covers the CACREP common-core area officially titled Appraisal of Individuals and Groups (often labeled "Assessment and Testing"). On the Counselor Preparation Comprehensive Examination (CPCE), the test has 160 multiple-choice items split into 8 content areas of 20 items each. Within each area only 17 items are scored and 3 are unscored pretest items, so the scored exam is 136 questions. Appraisal therefore contributes 17 scored items (12.5%). There is no single national pass score; each program sets its cut, commonly near one standard deviation below the national mean.

Testing time is about 3 hours 45 minutes, and the typical fee is $150.

What this domain actually tests

Appraisal is not vague "professional judgment." It is concrete measurement knowledge: defining and computing score types, distinguishing reliability from validity, knowing what each named instrument measures, and applying ethical standards for test use. Expect questions that give a number (a z-score, a percentile, a reliability coefficient) and ask what it means, or that name an instrument and ask what category it belongs to.

The two pillars: reliability and validity

ConceptQuestion it answersKey indicator
ReliabilityAre scores consistent and repeatable?Reliability coefficient (0 to 1); .80+ is acceptable, .90+ for high-stakes
ValidityDoes the test measure the right thing and support the decision?Evidence from content, criterion, and construct sources

A bathroom scale that reads 5 lb too high every time is perfectly reliable (consistent) but not valid (wrong value). This is the single most tested relationship in the domain: reliability is necessary but not sufficient for validity. A test must be reliable to be valid, but reliability alone never guarantees validity.

Norm-referenced versus criterion-referenced

Another distinction the CPCE tests directly is how scores are interpreted. A norm-referenced test compares a person to a norm group (a sample meant to represent the population): the WAIS, SAT, and most standardized batteries work this way, and the score answers "how does this person rank against peers?" A criterion-referenced test compares performance to a fixed standard or cutoff without reference to others, answering "did the person meet the benchmark?" A licensure exam with a pass score is criterion-referenced; a percentile-based aptitude test is norm-referenced.

Watch for stems that describe a mastery cutoff (criterion) versus a ranking against others (norm).

Standardization and the norming sample

A test is standardized when administration, scoring, and interpretation follow uniform procedures, and when a representative standardization (norming) sample establishes the score distribution. The quality of any norm-referenced interpretation depends on whether the norming sample matches the test-taker on relevant variables such as age, region, and demographics. When the sample is unrepresentative or outdated, even a reliable score can produce misleading interpretations, which is why the recency and relevance of norms is a recurring exam concern.

Types of reliability

  • Test-retest — same test, same people, two occasions; estimates stability over time. Threatened by practice effects and real change between sittings.
  • Internal consistency — how well items hang together within one administration; measured by coefficient alpha (Cronbach's alpha) or, for split halves, the Spearman-Brown corrected correlation.
  • Alternate (parallel) forms — two equivalent test versions; controls for memory of specific items.
  • Inter-rater — agreement between scorers, critical for projective tests and observation scales; reported as a correlation or Cohen's kappa.

Types of validity evidence

  • Content validity — do the items adequately sample the whole domain? Established by expert review, not statistics.
  • Criterion-related validity — does the test correlate with an outcome? Concurrent (criterion measured now) versus predictive (criterion measured later, e.g., SAT predicting freshman GPA).
  • Construct validity — does the test measure the abstract trait it claims? Supported by convergent evidence (correlates with related measures) and discriminant evidence (does not correlate with unrelated ones).

Standard error of measurement

The standard error of measurement (SEM) estimates how much an observed score would vary on retesting: as reliability rises, SEM falls. A counselor reports a confidence band (score plus or minus SEM) rather than a single point, because no score is error-free. Recognizing that the SEM links reliability to score interpretation is a frequent exam target.

Classical test theory in one line

Under classical test theory, every observed score = true score + error. Reliability is the proportion of observed-score variance that is true-score variance; error is everything unsystematic. This is why improving reliability (more items, clearer wording, trained raters) shrinks the SEM and tightens the confidence band. Random error lowers reliability and is unpredictable; systematic error (bias) does not lower reliability but does threaten validity because it shifts scores consistently in one direction.

Factors that raise or lower reliability

  • Test length — adding well-written items generally raises internal consistency (the logic behind the Spearman-Brown formula).
  • Item quality and clarity — ambiguous items add random error.
  • Sample heterogeneity — a wider range of true ability inflates reliability estimates; a restricted range deflates them.
  • Scoring objectivity — objective scoring (machine-scored multiple choice) is more reliable than subjective scoring, which depends on inter-rater agreement.

Keep these levers straight: if a stem says a test was lengthened or items were clarified, expect reliability (and validity) to improve; if it describes a narrow, homogeneous sample, expect a deflated reliability coefficient.

Test Your Knowledge

A new anxiety scale produces nearly identical scores when the same clients retake it a week later, but its scores show no relationship to any established anxiety measure or to clinical diagnosis. The scale is best described as:

A
B
C
D
Test Your Knowledge

Which type of validity is established primarily through expert judgment that the test items adequately sample the entire content domain, rather than through a correlation coefficient?

A
B
C
D