7.1 Data analysis & digital/computer forensics
Key Takeaways
- CAATs let examiners test 100% of a population; core tests include duplicate, gap, anomaly/outlier, join/correlation, and stratification analysis.
- Benford's Law expects the digit 1 to lead ~30.1% of the time; deviations flag possibly fabricated numbers but are a screening tool, not proof.
- Digital forensics preserves electronically stored information (ESI); the golden rule is never to analyze the original media.
- Forensic imaging (bit-stream copy), hashing (MD5/SHA-256), write blockers, and chain of custody protect and prove evidence integrity.
- Metadata can reveal backdating and authorship; once litigation is anticipated a legal hold prevents spoliation of ESI.
Turning Data Into Evidence
Modern fraud examinations rarely turn on a single smoking-gun document. Instead, the CFE mines large volumes of transactional data to surface the patterns, anomalies, and outliers that no manual review could realistically catch. Data analysis can be proactive (continuous monitoring designed to detect fraud early) or reactive (targeted testing launched once a specific allegation exists). Either way, the examiner must understand how the data is structured, verify its completeness and integrity, and document every step so the results survive challenge in court.
Core data analysis techniques
Computer-assisted audit techniques (CAATs) allow examiners to test 100 percent of a population rather than relying on a sample, which is critical because fraud is usually rare and deliberately hidden. Widely used tests include:
- Duplicate testing — identical invoice numbers, amounts, dates, or vendor bank accounts that reveal double payments or split transactions.
- Gap testing — missing check or invoice numbers in an otherwise sequential series, a classic sign of voided, hidden, or diverted items.
- Anomaly and outlier analysis — transactions falling outside expected ranges, such as payments deliberately structured just under an approval threshold.
- Join and correlation tests — matching employee addresses, phone numbers, or bank accounts against the vendor master file to expose shell-company and ghost-vendor schemes.
- Stratification and summarization — grouping data by amount, user, time, or location to spot concentrations of risk.
Benford's Law
Benford's Law — the "first-digit law" — observes that in many naturally occurring sets of numbers the leading digit is not uniformly distributed. The digit 1 appears first about 30.1 percent of the time, 2 about 17.6 percent, and the frequencies decline steadily to roughly 4.6 percent for 9. When people fabricate figures, they tend either to spread the digits too evenly or to cluster them around psychologically comfortable numbers, so the observed distribution deviates from Benford's expected curve. A CFE applies Benford's analysis to the first digit, the first two digits, or the last digits of large data sets — invoice amounts, expense reimbursements, or reported revenues — to flag populations that deserve a closer look.
Benford's Law is a screening tool, not proof of fraud: it tells the examiner where to look, not who is guilty. Its assumptions break down for assigned or bounded numbers (ZIP codes, sequential invoice IDs, prices capped at a limit) and for small samples, so results must be interpreted with judgment and confirmed by direct examination of the flagged items.
Ratio and textual analysis
Beyond transaction tests, examiners use ratio analysis to expose distortions. The relative size factor (RSF) test compares the largest value in a group — say, a single vendor's biggest invoice — against the next largest; a wide gap can signal an error or a fabricated payment. Comparing maximum-to-minimum values highlights amounts that do not belong. Textual analytics extends the same logic to unstructured data, scanning emails, memos, and chat logs for fraud-indicative keywords and pressure language. Because any one test throws off false positives, examiners rank and prioritize results, then confirm the strongest leads with focused document review before drawing conclusions.
Data mining and continuous monitoring
Data mining searches large data sets for hidden relationships using clustering, classification, and predictive scoring. Continuous monitoring embeds these tests into ongoing operations so red flags surface in near real time rather than months later. The value of any analysis depends entirely on data validity: the examiner must confirm that extracts are complete, that control totals reconcile back to the source system, and that formats and fields are normalized before drawing any conclusion. Garbage in, garbage out applies with full force, and undocumented data handling can render otherwise sound findings worthless in court.
Computer and Digital Forensics
Digital forensics is the identification, preservation, extraction, analysis, and documentation of electronically stored information (ESI) so that it is reliable and admissible. The single governing principle is to protect the integrity of the original evidence — the examiner must never conduct analysis on the original media.
Preservation and imaging
- Forensic imaging — create a bit-stream (bit-by-bit) copy of the entire drive, capturing slack space, unallocated space, and deleted files, then analyze the copy, never the source.
- Hashing — compute a cryptographic hash value (such as MD5 or SHA-256) of both the original and the image; matching hashes prove the copy is identical and unaltered.
- Write blockers — hardware or software that prevents any change to the source media during acquisition.
- Chain of custody — document who handled the evidence, when, where, and why, from seizure through to the courtroom, with no unexplained gaps.
Metadata and hidden data
Metadata — literally "data about data" — records when a file was created, modified, or last accessed, who authored it, and its revision history. It can prove that a document was backdated or reveal a document's true origin. Forensic tools also recover deleted files, examine slack space, parse email headers, and reconstruct internet and application activity, all of which can corroborate or refute an allegation.
E-discovery and spoliation
E-discovery is the process of identifying, collecting, reviewing, and producing ESI in litigation. Once litigation is reasonably anticipated, the organization must issue a legal hold to suspend routine deletion. Failure to preserve relevant ESI is spoliation, which can trigger monetary sanctions or an adverse-inference instruction telling the jury to assume the destroyed evidence was unfavorable. Because forensic work routinely implicates privacy, employment, and search-and-seizure law, the CFE coordinates with legal counsel and qualified technical specialists before acquiring any device.
Under Benford's Law, approximately how often is the digit 1 expected to appear as the leading digit in a large set of naturally occurring numbers?
A CFE seizes a suspect's laptop. Which approach best protects the integrity of the digital evidence?
Which CAAT test is most directly aimed at detecting double payments of the same invoice?