Data Quality, Integrity, Access, and Taxonomy
Key Takeaways
- AML systems are only as good as their data: garbage-in produces missed alerts (false negatives) and noise (false positives).
- Data quality has measurable dimensions — completeness, accuracy, consistency, timeliness, uniqueness, and validity.
- A common taxonomy and a single source of truth (golden record) prevent duplicate customers and fragmented monitoring.
- Access controls and data lineage protect confidentiality and create the audit trail regulators and auditors expect.
Data Quality, Integrity, Access, and Taxonomy
Every sanctions filter, transaction-monitoring scenario, and risk model consumes data. If that data is wrong, incomplete, or fragmented, the control fails silently — the system reports 'clear' while the risk is real. CAMS frames this as the principle that data quality is a control, not a back-office hygiene task.
The six data-quality dimensions
| Dimension | Definition | AML failure if missing |
|---|---|---|
| Completeness | All required fields populated | Beneficial owner blank → screening gap |
| Accuracy | Values reflect reality | Wrong date of birth → sanctions match missed |
| Consistency | Same value across systems | Two spellings of one name → fragmented monitoring |
| Timeliness | Data current and refreshed | Stale address → outdated geographic risk |
| Uniqueness | One record per real entity | Duplicate customers → activity split below thresholds |
| Validity | Conforms to format/rules | Free-text country field → screening can't parse |
Worked example. A customer is onboarded twice — once as 'Maria Garcia' and once as 'M. Garcia-Lopez' — because of a uniqueness failure. Each record shows $4,000 monthly wires, below a $5,000 monitoring threshold, so no alert fires. Aggregated, the real customer moves $8,000 monthly: classic structuring hidden by poor data. The control did not fail; the data did. The exam answer is entity resolution / deduplication into a golden record, not a new monitoring rule.
Taxonomy and the golden record
A data taxonomy is the agreed classification of customers, products, jurisdictions, and risk categories. Without it, one team's 'PEP' is another's 'high-profile client', and reporting cannot roll up. The golden record (master Customer Information File) is the single, reconciled source of truth that screening and monitoring read from. Reference data — sanctions lists, country-risk ratings, high-risk-business codes — must be versioned so an alert can be reproduced against the list as it stood that day.
Integrity, access, and lineage
- Integrity: data must not be altered improperly; changes are logged with who/when/why.
- Access control: apply least-privilege and need-to-know. SAR data and investigation files are highly restricted — over-broad access risks tipping off and breaches of confidentiality.
- Data lineage: the traceable path from source system → transformation → alert. Examiners and auditors expect you to reproduce why an alert fired and prove the underlying data.
- Retention: records (CDD, transactions, SARs) are generally retained for five years under most BSA/FATF regimes; the exam expects you to know that retention enables reconstruction of transactions.
Common traps
First, do not confuse false negatives (a real risk the data hid) with false positives (noise from over-matching). Poor data drives both. Second, fuzzy matching tuned too loose floods investigators; tuned too tight misses sanctioned parties — tuning is a documented, governed decision. Third, adding more rules to a system fed by bad data multiplies the noise; the disciplined fix is upstream data remediation. Fourth, broad data access is not 'efficiency' — uncontrolled access to SAR and investigation data is a reportable confidentiality failure.
The CAMS-correct answer treats data as a governed asset with owners, quality metrics, lineage, and least-privilege access.
Data governance roles and metrics
Good data does not happen by accident; it is governed. A data owner (a business executive) is accountable for the meaning and quality of a data domain, while a data steward operationally maintains it. Data quality metrics — completeness rates, match rates, exception counts — are tracked and reported like any other control metric, with thresholds and escalation. When a regulator asks 'how do you know your customer data is accurate?' the credible answer is a measured, governed program, not an assertion. CAMS scenario items reward identifying the owner and the metric, not just naming the defect.
Reference data and reproducibility
Screening and monitoring depend on reference data: sanctions lists, PEP databases, high-risk country ratings, and high-risk-business codes. This data changes constantly, so it must be versioned and dated. If an alert fired on 1 March against the OFAC list as it stood that day, an investigator must be able to reproduce the match against that exact list version months later. Loading a new list silently over the old one destroys reproducibility and undermines an examiner's ability to test the control — a frequently overlooked defect.
Structured versus unstructured data
Monitoring systems handle structured data (amounts, dates, account numbers in fixed fields) well, but much AML-relevant information arrives as unstructured data — free-text payment messages (the SWIFT MT103 remittance field), adverse-media articles, and scanned documents. A payment whose structured fields look benign may carry a sanctioned port name or a shell-company reference buried in free text. Controls increasingly parse unstructured fields, but free-text entry at onboarding (a country typed instead of selected from a list) is a validity defect that defeats automated screening.
A worked data-lineage scenario
An examiner asks why a $50,000 cash deposit never alerted. Investigation shows the branch system recorded it as a check, not cash, so the cash-structuring scenario never evaluated it. The defect is accuracy at the source, and the fix is upstream input controls plus reconciliation — not a new scenario. Data lineage is what lets the institution trace the alert (or its absence) back to the originating field and prove the root cause.
The exam-correct conclusion: when controls miss, interrogate the data feeding them before adding rules, because new rules built on bad data simply multiply false positives while leaving the true risk invisible.
A customer appears twice in the system under slightly different names, splitting their activity below monitoring thresholds so no alert fires. Which control most directly fixes this?
Tuning fuzzy-matching too tightly in a sanctions filter most directly increases the risk of which outcome?
Why is least-privilege access especially important for SAR and investigation data?