16.3 Data Governance, SQL, and Analytics
Key Takeaways
- Data governance assigns ownership, quality expectations, security requirements, retention rules, and permitted uses across the entire data life cycle.
- Candidates should distinguish data warehouses, data lakes, data marts, star schemas, snowflake schemas, data dictionaries, and normalized relational tables.
- Relational integrity depends on primary keys, foreign keys, referential integrity, validation rules, and metadata describing tables and fields.
- SQL questions test whether a query retrieves a relevant and complete data set for the objective, not whether a candidate can build an application.
- Analytics are reliable only when source data is sound, transformations are understood, and results reconcile back to control totals and the business question.
Data Governance and the Data Life Cycle
Data governance is the set of roles, policies, standards, and monitoring activities that keep data reliable and usable. It answers practical questions: Who owns the customer master file? Which fields are required? How long is payroll data retained? Who may extract data? What quality checks run before a report is trusted?
ISC Area I includes the data life cycle. A common five-stage model is definition, capture (creation), use, archival/storage, and destruction (disposition). Controls follow the data:
- Definition: standards, data dictionary, ownership, and classification are set.
- Capture: validation and authorization at entry.
- Use: integrity, access restriction, and change history.
- Storage/archival: security, encryption, backup, retention, and retrieval.
- Destruction: approved deletion or anonymization consistent with policy and law.
Data quality is usually described along dimensions such as completeness, accuracy, consistency, timeliness, validity, and uniqueness. A fact pattern naming one dimension is steering you to the control that protects it.
Governance also assigns clear roles. A data owner (often a business executive) is accountable for the data's classification, quality, and permitted uses; a data steward enforces standards day to day; a data custodian (typically IT) safeguards storage and access. The exam tests whether the right role is performing a control. For example, IT custodians should not unilaterally approve who may see payroll data; the business data owner sets that policy. Retention and disposition tie back to legal and regulatory drivers such as records-retention statutes and privacy laws, so deletion must be approved and documented, never ad hoc.
Storage and Schema Terms
| Term | Meaning | CPA use case |
|---|---|---|
| Data warehouse | Curated, structured data optimized for reporting | Financial dashboards and trend analysis |
| Data lake | Broad store of structured and unstructured raw data | Exploratory analytics before modeling |
| Data mart | Subject-area subset of a warehouse | Sales, payroll, or procurement reporting |
| Star schema | One fact table linked to denormalized dimension tables | Fast reporting by product, customer, date |
| Snowflake schema | More normalized dimensions (more tables, more joins) | Reduced redundancy at cost of query complexity |
| Data dictionary | Metadata about fields, formats, and owners | Understanding definitions before extraction |
| ETL / ELT | Extract, transform, load processes | Moving and reshaping data into the warehouse |
A relational database uses primary keys to uniquely identify records, foreign keys to link related tables, and referential integrity rules to prevent orphan records. Normalization stores each fact once to reduce redundancy and update anomalies; the tradeoff is that reporting then requires joins or a warehouse layer.
SQL Review for Completeness and Relevance
Candidates need not memorize every function. They must read standard SQL structure: SELECT lists fields, FROM names tables, JOIN connects tables, WHERE filters rows, GROUP BY summarizes, HAVING filters groups, and aggregates SUM, COUNT, AVG, MIN, MAX compute totals. Join type drives population:
- INNER JOIN keeps only rows matching in both tables, so unmatched records vanish.
- LEFT (OUTER) JOIN keeps all left-table rows even when no match exists on the right.
Worked example. A query to find all vendor payments uses payments p INNER JOIN vendors v ON p.vendor_id = v.id. Any payment with a null or invalid vendor_id is silently dropped, understating disbursements and masking exactly the unauthorized payments an auditor wants to see. A LEFT JOIN preserves them.
A query can be syntactically valid yet wrong for the objective. A revenue query selecting only posted invoices omits unbilled shipments. A date filter on invoice_date rather than shipment_date misses cutoff issues. Always confirm the population matches the objective before trusting the output.
Integrating Data for Analytics
Analytics often combine ERP tables, bank files, payroll systems, ticketing data, and external sources. Integration creates risk because systems use different keys, date formats, currencies, time zones, or definitions. Before relying on analytics, reconcile record counts, hash totals, control totals, and key fields back to source reports, and confirm transformations did not duplicate or drop rows.
High-value analytics for accountants:
- Duplicate vendor payments matched by invoice number, amount, and vendor.
- Round-dollar journal entries posted late at night or by unusual users.
- Sales orders shipped before credit approval.
- Payroll payments to terminated employees (joining payroll to the HR termination file).
- Inventory items with negative quantities or stale movement dates.
- Gaps or duplicates in sequential document numbers (completeness of transactions).
Exam Focus
When a question shows a query or report extract, ask three things: Does the data population match the objective? Are filters excluding needed records? Are joins or grouping creating duplicates or omissions? The best answer usually protects completeness first, then relevance, accuracy, and proper interpretation.
Finally, judge the analytic technique against the question. Descriptive analytics summarize what happened; diagnostic analytics explain why; predictive analytics estimate what will happen; and prescriptive analytics recommend action. A 100% transaction test (rather than a sample) is feasible when full populations are available, but it raises the stakes on data reliability: testing every row of a flawed extract simply produces precise-looking nonsense. Document the data source, extraction logic, and reconciliation so the work is reproducible and defensible.
An analyst wants to identify all shipments made before billing. The query selects from the invoice table and filters invoice_date before year-end, but it does not join to the shipment table. What is the main problem?
Which database design feature most directly prevents a sales-order line from referencing a product ID that does not exist in the product master table?
An auditor runs payments INNER JOIN vendors to total disbursements, but some payments have a null vendor_id. What is the most likely effect?