5.8 Compliance and PII Data Management
Key Takeaways
- Tag tables and columns with metadata labels (e.g., "PII", "GDPR", "PHI") in Unity Catalog to track sensitive data across the organization.
- Information schema tables enable programmatic discovery of all tables and columns tagged with specific sensitivity labels.
- Right to deletion (GDPR Article 17) can be implemented using DELETE FROM or MERGE INTO with a list of user IDs to remove.
- VACUUM must be run after deletion to physically remove the data files — Delta time travel could otherwise still access deleted records.
- Data retention policies should be implemented using scheduled jobs that delete or archive data beyond the retention period.
Compliance and PII Data Management
Quick Answer: Tag sensitive columns with metadata labels in Unity Catalog for discovery. Implement right-to-deletion with DELETE + VACUUM. Use scheduled jobs for data retention. Combine column masks and row filters for access control.
Tagging Sensitive Data
-- Add tags to tables
ALTER TABLE prod.sales.customers SET TAGS ('sensitivity' = 'PII', 'regulation' = 'GDPR');
-- Add tags to columns
ALTER TABLE prod.sales.customers ALTER COLUMN email SET TAGS ('pii_type' = 'email');
ALTER TABLE prod.sales.customers ALTER COLUMN ssn SET TAGS ('pii_type' = 'ssn', 'sensitivity' = 'high');
ALTER TABLE prod.sales.customers ALTER COLUMN phone SET TAGS ('pii_type' = 'phone');
-- Remove tags
ALTER TABLE prod.sales.customers UNSET TAGS ('sensitivity');
Discovering Tagged Data
-- Find all columns tagged as PII
SELECT
table_catalog,
table_schema,
table_name,
column_name,
tag_name,
tag_value
FROM system.information_schema.column_tags
WHERE tag_name = 'pii_type';
Right to Deletion (GDPR)
-- Step 1: Delete user records
DELETE FROM prod.sales.customers
WHERE customer_id IN (SELECT customer_id FROM deletion_requests);
DELETE FROM prod.sales.orders
WHERE customer_id IN (SELECT customer_id FROM deletion_requests);
-- Step 2: VACUUM to physically remove data files
-- (without VACUUM, time travel could access deleted records)
VACUUM prod.sales.customers RETAIN 0 HOURS;
VACUUM prod.sales.orders RETAIN 0 HOURS;
-- Note: RETAIN 0 HOURS requires disabling the safety check:
-- SET spark.databricks.delta.retentionDurationCheck.enabled = false;
Important: VACUUM After Deletion
- DELETE removes records from the logical table (current version)
- But the data still exists in old Parquet files (accessible via time travel)
- VACUUM physically removes the old files, making the deletion permanent
- For GDPR compliance, VACUUM must run with a short retention period
Data Retention Policies
-- Scheduled job: delete data older than 2 years
DELETE FROM prod.logs.user_activity
WHERE event_date < date_sub(current_date(), 730);
-- Archive before deletion
INSERT INTO prod.archive.user_activity
SELECT * FROM prod.logs.user_activity
WHERE event_date < date_sub(current_date(), 730);
DELETE FROM prod.logs.user_activity
WHERE event_date < date_sub(current_date(), 730);
VACUUM prod.logs.user_activity;
Compliance Checklist
| Requirement | Implementation |
|---|---|
| Data discovery | Tag tables/columns with sensitivity labels |
| Access control | GRANT/REVOKE + column masks + row filters |
| Audit trail | Unity Catalog audit logs (system.access.audit) |
| Data lineage | Automatic Unity Catalog lineage |
| Right to deletion | DELETE + VACUUM |
| Data retention | Scheduled deletion/archival jobs |
| Encryption | Cloud-native encryption at rest + in transit |
| Cross-border | Regional metastores + location-aware catalogs |
On the Exam: Understand that DELETE alone is not sufficient for GDPR compliance — VACUUM must also run to physically remove data from storage. Know that tags help discover and classify sensitive data across the organization.
A company receives a GDPR deletion request. A data engineer runs DELETE FROM customers WHERE customer_id = 12345. Is the data fully deleted?
How can a data governance team find all columns in the Lakehouse that contain email addresses?
You've completed this section
Continue exploring other exams