5.5 Lakehouse Federation
Key Takeaways
- Lakehouse Federation lets you query external data systems (MySQL, PostgreSQL, Snowflake, Redshift, BigQuery, SQL Server, etc.) from Databricks without ingesting or copying the data.
- A connection stores the credentials and host details for an external system; a foreign catalog mirrors that system's databases as Unity Catalog objects.
- Federated tables are fully governed by Unity Catalog — the same GRANT/REVOKE, lineage, and audit apply.
- Federation is data virtualization for querying live external sources in place, distinct from Delta Sharing's outbound distribution.
- Query pushdown sends filters and aggregations to the source system when possible to reduce data movement.
Quick Answer: Lakehouse Federation lets you query external data systems directly from Databricks without ingesting or copying the data. You create a connection (host + credentials) to the external system, then a foreign catalog that mirrors its databases as Unity Catalog objects. Federated tables are fully governed by UC — same grants, lineage, and audit. It is data virtualization, not data movement.
What Federation Solves
Organizations rarely keep all data in one place. Operational data may live in PostgreSQL or MySQL, analytics in Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse, or SQL Server. Traditionally, querying it in Databricks meant building ingestion pipelines that copy and replicate the data — adding latency, cost, and drift. Lakehouse Federation removes the copy: you query the source in place, live, from Databricks SQL or notebooks. The external system's data appears as ordinary tables you can join against your lakehouse data, all under one governance umbrella.
Connections and Foreign Catalogs
Two Unity Catalog objects make federation work:
| Object | Purpose |
|---|---|
| Connection | A securable that stores the path/host and credentials for an external system (e.g., a PostgreSQL server). Reusable across catalogs. |
| Foreign catalog | A UC catalog that mirrors an external database, so its schemas and tables show up in the three-level namespace and are queryable. |
The workflow: an admin creates a connection to the source, then creates a foreign catalog from that connection. After that, users browse and query the external tables exactly like native UC tables, e.g. SELECT * FROM pg_foreign.public.customers JOIN main.sales.orders …. Privileges CREATE CONNECTION, USE CONNECTION, and CREATE FOREIGN CATALOG govern who can set this up. Catalog (HMS) federation is a related capability that mounts an external or legacy Hive metastore as a foreign catalog.
Governance, Pushdown, and Federation vs. Delta Sharing
Because a foreign catalog is a real Unity Catalog catalog, federated data inherits the full UC model: you GRANT/REVOKE on the foreign catalog's objects, lineage is tracked, and access is audited — the same controls as native tables. To stay efficient, federation uses query pushdown: filters, projections, and aggregations are pushed down to the source engine when possible, so only the needed results travel back to Databricks rather than the whole table.
Do not confuse the two interoperability features:
| Lakehouse Federation | Delta Sharing | |
|---|---|---|
| Direction | Inbound — you query others' systems | Outbound — you share your data |
| Movement | No ingest; query in place | No copy; recipient reads live |
| Use case | Unified internal access / virtualization | Cross-org / cross-platform distribution |
Mnemonic: Federation pulls in; Sharing sends out.
Supported Sources and Setup Privileges
Lakehouse Federation ships with built-in connectors for the most common external systems, so you do not write custom integration code:
| Category | Example sources |
|---|---|
| Operational databases | MySQL, PostgreSQL, SQL Server |
| Cloud warehouses | Snowflake, Amazon Redshift, Google BigQuery, Azure Synapse |
| Other | Databricks (another workspace/metastore), Hive metastore (HMS) |
Setting up federation requires admin-level privileges: CREATE CONNECTION to register the source credentials, CREATE FOREIGN CATALOG to mirror it, and then ordinary USE CATALOG/USE SCHEMA/SELECT grants so analysts can query the foreign tables. Because the connection holds credentials, only trusted admins should be able to create one; analysts merely consume the resulting foreign catalog.
When to Federate vs. Ingest
Federation is ideal for ad-hoc exploration, occasional joins, and avoiding duplication of data that already has an authoritative home elsewhere. It is not a replacement for ingestion in every case: if a table is queried constantly with heavy aggregations, repeatedly pushing those queries to a busy operational database can strain that source and add latency. , via Lakeflow Connect or Auto Loader) and serving it from optimized Delta tables performs better.
The exam-relevant judgment: choose federation when you want live access without copies and the query load is modest; choose ingestion when you need high performance, heavy transformation, or to offload a fragile source system.
How Federation Fits the Governance Story
The defining benefit of Lakehouse Federation is unified governance over data you did not move. Once an external PostgreSQL or Snowflake database is mounted as a foreign catalog, its tables behave like any other Unity Catalog object: you GRANT/REVOKE on them, their reads appear in audit logs, and lineage tracks how federated data flows into downstream lakehouse tables. This lets a data engineer give analysts a single, consistent place to query everything — native Delta tables and external systems alike — under one permission model, without the cost and staleness of replication.
The recurring exam points are that federation queries in place without ingesting, that the connection holds credentials while the foreign catalog exposes the schemas, and that query pushdown sends filters and aggregations to the source to minimize data transfer. Remember the contrast with Delta Sharing one more time: Federation pulls external data in for unified internal access; Delta Sharing sends your data out to external recipients. In short, federation extends Unity Catalog's reach beyond data physically stored in the lakehouse, making external systems first-class, governed, queryable citizens of the namespace.
A team needs to join customer data living in an external PostgreSQL database with sales data in Databricks, without building an ingestion pipeline. Which capability fits?
In Lakehouse Federation, what does a foreign catalog provide?
Which statement correctly distinguishes Lakehouse Federation from Delta Sharing?