5.3 Data Lineage and Audit Logging
Key Takeaways
- Unity Catalog captures data lineage automatically — no configuration — down to the column level across notebooks, SQL queries, and jobs.
- Lineage is aggregated across all workspaces attached to the metastore and viewable in Catalog Explorer or via the lineage system tables and REST API.
- Column-level lineage requires both source and target to be referenced by table name; path-based references are not captured.
- System tables (the `system` catalog) expose operational data: audit logs in `system.access.audit`, lineage, query history, and billable usage.
- Audit logs record every governed action (who did what, when) and can be shipped to a SIEM via cloud-native log delivery.
Quick Answer: Unity Catalog captures data lineage automatically — with zero configuration — down to the column level, across notebooks, SQL, and jobs, aggregated across every workspace on the metastore. Governed actions are recorded in audit logs, surfaced through system tables (notably
system.access.audit), which you query in SQL or ship to a SIEM.
Automatic Data Lineage
Data lineage answers "where did this data come from, and what depends on it?" In Unity Catalog, lineage is captured automatically as queries run — you do not instrument anything. UC tracks lineage for any read or write performed through Databricks compute: notebooks, SQL queries/dashboards, and jobs. Crucially, lineage is captured to the column level — you can see that gold.revenue.net was derived from silver.orders.amount minus silver.orders.discount. Lineage is aggregated across all workspaces attached to the metastore, giving one account-wide picture.
How and Where Lineage Is Viewed — and Its Limits
You view lineage in Catalog Explorer as an interactive graph (upstream and downstream), or retrieve it programmatically via the lineage system tables and the lineage REST API. Lineage covers tables, views, columns, notebooks, jobs/workflows, dashboards, and ML models.
Know these limits for the exam:
| Aspect | Behavior |
|---|---|
| Capture trigger | Automatic, when a query executes on Databricks compute |
| Granularity | Column-level (and table-level) |
| Column lineage requirement | Both source and target must be referenced by table name |
| Path references | If a query reads/writes by file path, column lineage is not captured |
| Retention | Lineage data is retained for a rolling window (e.g., ~1 year) |
| Scope | Only operations run through Databricks are captured |
The most common trap: column lineage is lost when you reference data by path (e.g., s3://…) instead of by a UC table name.
System Tables and Audit Logging
System tables are read-only, Databricks-managed tables in the system catalog that expose operational telemetry as queryable Delta data. They must be enabled per schema by an admin. Key system tables:
system.access.audit— the audit log: every governed action with actor, action, parameters, timestamp, and result.system.access.table_lineageandsystem.access.column_lineage— lineage records.system.query.history— executed queries.system.billing.usage— billable consumption.
Audit logs capture who did what, when — grants, table creates/drops, data reads, share access, login events, and more. Because audit data is just a Delta table, you can analyze it with ordinary SQL (e.g., "who queried the PII table last month?") or deliver it to a SIEM via cloud-native log delivery (CloudWatch, Azure Monitor, Google Cloud Logging). For compliance, audit logs are the authoritative record of access to governed data.
SELECT user_identity.email, action_name, event_time
FROM system.access.audit
WHERE service_name = 'unityCatalog'
ORDER BY event_time DESC;
Why Lineage and Auditing Matter
Lineage and audit logging are the evidence layer of governance. Lineage answers impact and provenance questions: if a Bronze table is wrong, lineage shows every downstream Silver/Gold table, dashboard, and model affected; conversely, it shows the trusted upstream source of a reported number. This is essential for impact analysis, debugging, and demonstrating data provenance to auditors. Audit logging answers accountability questions: who read the PII table, who granted access, who dropped a table, and when.
Working With System Tables
System tables turn governance telemetry into ordinary, queryable Delta data, so you analyze it with the same SQL you use for business data. Because they are Delta tables, you can join them — for example, join system.access.audit to a list of sensitive tables to find unauthorized reads. Common analyses include:
| Question | System table |
|---|---|
| Who accessed our PII table last month? | system.access.audit |
| What feeds this Gold table? | system.access.table_lineage |
| Which queries scanned the most data? | system.query.history |
| What did this workload cost? | system.billing.usage |
System tables must be enabled per schema by a metastore admin before they populate, and they are read-only. For long-term retention or security tooling, audit data is also delivered to external systems (a SIEM) via cloud-native log delivery, where it can drive alerting on suspicious access patterns.
Putting Lineage and Audit Together
In practice, lineage and audit logs are complementary halves of governance observability. Lineage is structural — it maps the static relationships between datasets, columns, notebooks, jobs, dashboards, and models, captured automatically as workloads run. Audit logs are behavioral — they record the time-ordered stream of actions principals took. A typical investigation uses both: lineage to find which downstream Gold tables were built from a compromised source, then the audit log to see who queried or exported those tables afterward.
Neither requires manual instrumentation; both are enabled by Unity Catalog's centralized model, where every governed action across every attached workspace funnels into one metastore. The system.access.audit table is the authoritative record for compliance reporting, queryable from Delta system tables.
Lineage and audit quick recap
- Unity Catalog captures column-level lineage automatically for queries that read/write UC tables (the table must be named, not anonymous).
- Lineage spans tables, notebooks, jobs, and dashboards, viewable in Catalog Explorer or via the lineage system tables.
system.access.auditrecords who did what and when, supporting security investigations and compliance evidence.
How does Unity Catalog capture column-level data lineage for a Databricks SQL query?
A job writes a Delta table by reading from a raw file path (s3://bucket/raw/) rather than a Unity Catalog table name. What is the lineage impact?
Which system table holds Unity Catalog audit logs of governed actions such as grants, reads, and table drops?