Validity, Reliability, IOA, Procedural Integrity, and Dosage
Key Takeaways
- Accuracy is closeness to the true value; reliability is consistency; validity is measuring the right behavior/dimension; they are independent.
- IOA is a reliability index, not validity; the method must match the data (total count, mean count-per-interval, exact, trial-by-trial, duration).
- Total count IOA = smaller/larger x 100; mean count-per-interval is more conservative because within-interval disagreements do not cancel out.
- Most fields treat 80%+ IOA as minimally acceptable, with 90%+ preferred, though no threshold is universal.
- Check procedural integrity and dosage before concluding a flat intervention has failed.
Accuracy, Reliability, and Validity
Three measurement-quality concepts anchor this section. Accuracy is the extent to which observed values match the true value of the behavior (often established by an independent calibrated standard or a thorough "true" count). Reliability is the consistency of measurement: the same behavior measured repeatedly yields the same value. Validity asks whether you measured the right thing, the behavior and dimension that answer the question.
These are independent. A bathroom scale that always reads 5 pounds high is reliable but inaccurate. Observers can be highly reliable yet invalid if they share a flawed definition, measuring accidental contact as "aggression" identically every time. The exam's recurring lesson: high agreement never proves validity. Always confirm the measure captures the intended response class and dimension before trusting consistent data.
Interobserver Agreement (IOA): Types and a Worked Calculation
Interobserver agreement (IOA) is the degree to which two independent observers report the same values for the same events. It is a reliability index, not a validity index. The method must match the measurement system:
| IOA method | Used with | How it is computed |
|---|---|---|
| Total count IOA | Event/frequency data | (smaller count / larger count) x 100 |
| Mean count-per-interval IOA | Count data split into intervals | average of per-interval (smaller/larger) percentages |
| Exact (interval-by-interval) agreement | Interval data | intervals of exact agreement / total intervals x 100 |
| Trial-by-trial IOA | Discrete-trial / opportunity data | agreements / (agreements + disagreements) x 100 |
| Total duration IOA | Duration data | (smaller duration / larger duration) x 100 |
Worked example (mean count-per-interval). Two observers count a behavior across five 1-minute intervals:
| Interval | Obs A | Obs B | smaller/larger |
|---|---|---|---|
| 1 | 4 | 4 | 4/4 = 100% |
| 2 | 5 | 4 | 4/5 = 80% |
| 3 | 3 | 3 | 3/3 = 100% |
| 4 | 2 | 3 | 2/3 = 67% |
| 5 | 6 | 5 | 5/6 = 83% |
Mean count-per-interval IOA = (100 + 80 + 100 + 67 + 83) / 5 = 86%. Note that total count IOA on the same data (totals 20 vs. 19) would be 19/20 = 95%, higher because within-interval disagreements cancel out across the session. This is exactly why mean count-per-interval is the more conservative, more sensitive method: matching totals can hide trial-level disagreement. Most fields treat 80%+ as the minimum acceptable IOA, with 90%+ preferred, though no single threshold is universal.
Procedural Integrity, Treatment Fidelity, and Dosage
Data quality is more than the dependent-variable line on a graph. Procedural integrity (treatment fidelity / treatment integrity) measures whether the independent variable, the intervention, was implemented as written. Low integrity threatens internal validity: if outcomes are flat but the plan was not run correctly, you cannot conclude the plan failed. Dosage measures the amount of exposure: minutes, sessions, opportunities, or trials delivered.
Use this decision chain when interpreting disappointing data:
- Observers disagree -> check the definition, training, scoring rules, and IOA method first.
- Data do not match the clinical question -> check validity of the measure and dimension.
- Intervention data are flat -> check procedural integrity and dosage before rejecting the plan.
- Change appears only on some days -> check setting events, schedule, observer coverage, and representativeness.
The defensible rule: if outcomes are poor but integrity is low, improve implementation and collect more data; if integrity and dosage are adequate and outcomes are still poor, modification is warranted. Do not let a strong IOA value distract from validity or integrity; observers can agree perfectly on the wrong response class while the intervention was never delivered as designed.
Threats to Accuracy and Choosing the Right IOA Method
Several predictable threats degrade accuracy and reliability, and the exam expects you to name and prevent them. Observer drift is the gradual, unintentional change in how an observer applies a definition over time, two observers trained together slowly diverge. Observer reactivity occurs when being observed changes the observer's scoring (e.g., scoring more carefully when a supervisor is present).
Observer bias / expectancy is scoring that is nudged toward an anticipated result, which is why observers should be blind to phase or condition when feasible. Poorly designed datasheets and complex definitions also lower accuracy. The standard safeguards are clear definitions, thorough training, periodic recalibration, and routine IOA checks across the study, not just at the start.
Matching the IOA method to the data is itself a high-yield skill. The wrong method can inflate agreement and hide real disagreement:
| Data type | Preferred IOA | Why |
|---|---|---|
| Free-operant count | Mean count-per-interval (more sensitive) over total count | Within-interval errors cancel in total count |
| Interval recording | Exact agreement; or occurrence/nonoccurrence IOA for rare/dense behavior | Exact is most stringent |
| Discrete-trial | Trial-by-trial | Preserves opportunity-level agreement |
| Duration / latency | Total duration or mean duration-per-occurrence | Matches the timed dimension |
For interval data with very low-rate behavior, occurrence-only IOA (agreement only on intervals where at least one observer scored an occurrence) prevents agreement inflation from a long run of jointly empty intervals; for very high-rate behavior, nonoccurrence IOA does the parallel job.
Recognizing when total count IOA overstates agreement, and when occurrence/nonoccurrence agreement is the honest index, is exactly the nuance Domain C items probe. The unifying principle: report the agreement statistic that is most stringent and most informative for the measurement system you actually used, then interpret it as a check on reliability, never as evidence of validity.
Two observers record event data for a session. Observer A counts 24 responses and Observer B counts 21. Using total count IOA, what is the agreement?
A scale used to verify a behavior count consistently reads exactly the same value every trial but that value is always 3 responses higher than the true count. How should this be described?
An intervention has produced no improvement over two weeks. A fidelity check shows staff implemented the plan correctly on only 40% of opportunities. What is the most defensible next step?
For discrete-trial data where each trial is scored correct/incorrect, which IOA method is most appropriate?