Data Quality, Integrity, Access, and Taxonomy

Key Takeaways

  • AML systems are only as good as their data: garbage-in produces missed alerts (false negatives) and noise (false positives).
  • Data quality has measurable dimensions — completeness, accuracy, consistency, timeliness, uniqueness, and validity.
  • A common taxonomy and a single source of truth (golden record) prevent duplicate customers and fragmented monitoring.
  • Access controls and data lineage protect confidentiality and create the audit trail regulators and auditors expect.
Last updated: June 2026

Data Quality, Integrity, Access, and Taxonomy

Every sanctions filter, transaction-monitoring scenario, and risk model consumes data. If that data is wrong, incomplete, or fragmented, the control fails silently — the system reports 'clear' while the risk is real. CAMS frames this as the principle that data quality is a control, not a back-office hygiene task.

The six data-quality dimensions

DimensionDefinitionAML failure if missing
CompletenessAll required fields populatedBeneficial owner blank → screening gap
AccuracyValues reflect realityWrong date of birth → sanctions match missed
ConsistencySame value across systemsTwo spellings of one name → fragmented monitoring
TimelinessData current and refreshedStale address → outdated geographic risk
UniquenessOne record per real entityDuplicate customers → activity split below thresholds
ValidityConforms to format/rulesFree-text country field → screening can't parse

Worked example. A customer is onboarded twice — once as 'Maria Garcia' and once as 'M. Garcia-Lopez' — because of a uniqueness failure. Each record shows $4,000 monthly wires, below a $5,000 monitoring threshold, so no alert fires. Aggregated, the real customer moves $8,000 monthly: classic structuring hidden by poor data. The control did not fail; the data did. The exam answer is entity resolution / deduplication into a golden record, not a new monitoring rule.

Taxonomy and the golden record

A data taxonomy is the agreed classification of customers, products, jurisdictions, and risk categories. Without it, one team's 'PEP' is another's 'high-profile client', and reporting cannot roll up. The golden record (master Customer Information File) is the single, reconciled source of truth that screening and monitoring read from. Reference data — sanctions lists, country-risk ratings, high-risk-business codes — must be versioned so an alert can be reproduced against the list as it stood that day.

Integrity, access, and lineage

  • Integrity: data must not be altered improperly; changes are logged with who/when/why.
  • Access control: apply least-privilege and need-to-know. SAR data and investigation files are highly restricted — over-broad access risks tipping off and breaches of confidentiality.
  • Data lineage: the traceable path from source system → transformation → alert. Examiners and auditors expect you to reproduce why an alert fired and prove the underlying data.
  • Retention: records (CDD, transactions, SARs) are generally retained for five years under most BSA/FATF regimes; the exam expects you to know that retention enables reconstruction of transactions.

Common traps

First, do not confuse false negatives (a real risk the data hid) with false positives (noise from over-matching). Poor data drives both. Second, fuzzy matching tuned too loose floods investigators; tuned too tight misses sanctioned parties — tuning is a documented, governed decision. Third, adding more rules to a system fed by bad data multiplies the noise; the disciplined fix is upstream data remediation. Fourth, broad data access is not 'efficiency' — uncontrolled access to SAR and investigation data is a reportable confidentiality failure.

The CAMS-correct answer treats data as a governed asset with owners, quality metrics, lineage, and least-privilege access.

Data governance roles and metrics

Good data does not happen by accident; it is governed. A data owner (a business executive) is accountable for the meaning and quality of a data domain, while a data steward operationally maintains it. Data quality metrics — completeness rates, match rates, exception counts — are tracked and reported like any other control metric, with thresholds and escalation. When a regulator asks 'how do you know your customer data is accurate?' the credible answer is a measured, governed program, not an assertion. CAMS scenario items reward identifying the owner and the metric, not just naming the defect.

Reference data and reproducibility

Screening and monitoring depend on reference data: sanctions lists, PEP databases, high-risk country ratings, and high-risk-business codes. This data changes constantly, so it must be versioned and dated. If an alert fired on 1 March against the OFAC list as it stood that day, an investigator must be able to reproduce the match against that exact list version months later. Loading a new list silently over the old one destroys reproducibility and undermines an examiner's ability to test the control — a frequently overlooked defect.

Structured versus unstructured data

Monitoring systems handle structured data (amounts, dates, account numbers in fixed fields) well, but much AML-relevant information arrives as unstructured data — free-text payment messages (the SWIFT MT103 remittance field), adverse-media articles, and scanned documents. A payment whose structured fields look benign may carry a sanctioned port name or a shell-company reference buried in free text. Controls increasingly parse unstructured fields, but free-text entry at onboarding (a country typed instead of selected from a list) is a validity defect that defeats automated screening.

A worked data-lineage scenario

An examiner asks why a $50,000 cash deposit never alerted. Investigation shows the branch system recorded it as a check, not cash, so the cash-structuring scenario never evaluated it. The defect is accuracy at the source, and the fix is upstream input controls plus reconciliation — not a new scenario. Data lineage is what lets the institution trace the alert (or its absence) back to the originating field and prove the root cause.

The exam-correct conclusion: when controls miss, interrogate the data feeding them before adding rules, because new rules built on bad data simply multiply false positives while leaving the true risk invisible.

Test Your Knowledge

A customer appears twice in the system under slightly different names, splitting their activity below monitoring thresholds so no alert fires. Which control most directly fixes this?

A
B
C
D
Test Your Knowledge

Tuning fuzzy-matching too tightly in a sanctions filter most directly increases the risk of which outcome?

A
B
C
D
Test Your Knowledge

Why is least-privilege access especially important for SAR and investigation data?

A
B
C
D