Databricks Unity Catalog Integration Reference
Databricks Unity Catalog allows you to manage and access data in your Databricks account across all of your workspaces and introduces fine-grained access controls in Databricks.
Immuta’s integration with Unity Catalog allows you to manage multiple Databricks workspaces through Unity Catalog while protecting your data with Immuta policies. Instead of manually creating UDFs or granting access to each table in Databricks, you can author your policies in Immuta and have Immuta manage and enforce Unity Catalog access-control policies on your data in Databricks clusters or SQL warehouses:
- Subscription policies: Immuta subscription policies automatically grant and revoke access to Databricks tables.
- Data policies: Immuta data policies enforce row- and column-level security without creating views, so users can query tables as they always have without their workflows being disrupted.
Unity Catalog object model
Unity Catalog uses the following hierarchy of data objects:
- Metastore: Created at the account level and is attached to one or more Databricks workspaces. The metastore contains metadata of all the catalogs, schemas, and tables available to query. All clusters on that workspace use the configured metastore and all workspaces that are configured to use a single metastore share those objects.
- Catalog: A catalog sits on top of schemas (also called databases) and tables to manage permissions across a set of schemas.
- Schema: Organizes tables and views.
- Table: Tables can be managed or external tables.
For details about the Unity Catalog object model, see the Databricks Unity Catalog documentation.
Feature support
The Databricks Unity Catalog integration supports
- managing and accessing data across multiple Databricks workspaces
- enforcing Unity Catalog row-, column-, and table-level access controls on Databricks clusters and SQL warehouses:
- applying column masking and row-redaction policies on tables
- applying subscription polices on tables and views
- applying NULL masking policies on ARRAY, MAP, or STRUCT type columns
- enforcing Unity Catalog access controls, even if Immuta becomes disconnected
- auditing activity of both Immuta users and non-Immuta users
- Delta and Parquet files
- allowing non-Immuta reads and writes
- using Photon
- using a proxy server
Architecture
Unity Catalog supports managing permissions at the Databricks account level through controls applied directly to objects in the metastore. To interact with the metastore and apply controls to any table, Immuta requires a personal access token (PAT) for an Immuta service principal with permissions to manage all data protected by Immuta. See the permissions requirements section for a list of specific Databricks privileges.
Immuta uses this service principal to run queries that set up all the tables, user-defined
functions (UDFs), and other data necessary for policy enforcement. Upon enabling the native integration, Immuta
will create a catalog named after your provided workspaceName
that contains two schemas:
immuta_system
: Contains internal Immuta data.immuta_policies
: Contains policy UDFs.
When policies require changes to be pushed to Unity Catalog, Immuta updates the internal tables in the
immuta_system
schema with the updated policy information. If necessary, new UDFs are pushed to replace
any out-of-date policies in the immuta_policies
schema and any row filters or column masks are
updated to point at the new policies. Many of these operations require compute on the configured Databricks
cluster or SQL endpoint, so compute must be available for these policies to succeed.
Catalog isolation
Immuta’s Databricks Unity Catalog integration manages a single metastore per integration. Prior to catalog isolation, a Unity Catalog metastore in Databricks was available unrestricted across workspaces in Databricks, which made integrating against a metastore independent of the workspace attached to that metastore possible. However, when catalog isolation is enabled on a Databricks workspace, you can assign catalogs available in the Unity Catalog metastore to that workspace. As a result, Immuta cannot see all catalogs in a metastore when integrated with that workspace. This behavior is problematic for customers who use catalog isolation to separate environments or business units, as Immuta cannot see all the data that may need to be governed.
To avoid this issue,
- Set up a dedicated Databricks workspace for Immuta that has catalog isolation disabled so that Immuta can see all data in the metastore.
- Configure Immuta’s Databricks Unity Catalog integration with that workspace to govern all data in the metastore.
Other workspaces that have catalog isolation enabled can continue to function in Databricks as they do today. For more information on catalog isolation, see the official Databricks documentation.
Policy enforcement
Immuta’s Unity Catalog integration applies Databricks table-, row-, and column-level security controls that are enforced natively within Databricks. Immuta's management of these Databricks security controls is automated and ensures that they synchronize with Immuta policy or user entitlement changes.
- Table-level security: Immuta manages REVOKE and GRANT privileges on securable objects in Databricks through subscription policies. When you create a subscription policy in Immuta, Immuta uses the Unity Catalog API to issue GRANTS or REVOKES against the catalog, schema, or table in Databricks for every user affected by that subscription policy.
- Row-level security: Immuta applies SQL UDFs to restrict access to rows for querying users.
- Column-level security: Immuta applies column-mask SQL UDFs to tables for querying users. These column-mask UDFs run for any column that requires masking.
The Unity Catalog integration supports the following policy types:
- Subscription policies
-
- Conditional masking
- Constant
- Custom masking
- Hashing
- Null
- Regex: You
must use the global regex flag (
g
) when creating a regex masking policy in this integration. You cannot use the case insensitive regex flag (i
) when creating a regex masking policy in this integration. See the limitations section for examples. - Rounding (date and numeric rounding)
-
- Matching (only show rows where)
- Custom WHERE
- Never
- Where user
- Where value in column
- Minimization
- Time-based restrictions
- Matching (only show rows where)
Project-scoped purpose exceptions for Databricks Unity Catalog
Project-scoped purpose exceptions for Databricks Unity Catalog integrations allow you to apply purpose-based policies to Databricks data sources in a project. As a result, users can only access that data when they are working within that specific project.
Databricks Unity Catalog views
If you are using views in Databricks Unity Catalog, one of the following must be true for project-scoped purpose exceptions to apply to the views in Databricks:
- The view and underlying table are registered as Immuta data sources and added to a project: If a view and its underlying table are both added as Immuta data sources, both of these assets must be added to the project for the project-scoped purpose exception to apply. If a view and underlying table are both added as data sources but the table is not added to an Immuta project, the purpose exception will not apply to the view because Databricks does not support fine-grained access controls on views.
- Only the underlying table is registered as an Immuta data source and added to a project: If only the underlying table is registered as an Immuta data source but the view is not registered, the purpose exception will apply to both the table and corresponding view in Databricks. Views are the only Databricks object that will have Immuta policies applied to them even if they're not registered as Immuta data sources (as long as their underlying tables are registered).
Masked joins for Databricks Unity Catalog
Private preview
This feature is available to select accounts. Reach out to your Immuta representative to enable this feature.
This feature allows masked columns to be joined across data sources that belong to the same project. When data sources do not belong to a project, Immuta uses a unique salt per data source for hashing to prevent masked values from being joined. (See the Why use masked joins? guide for an explanation of that behavior.) However, once you add Databricks Unity Catalog data sources to a project and enable masked joins, Immuta uses a consistent salt across all the data sources in that project to allow the join.
For more information about masked joins and enabling them for your project, see the Masked joins section of documentation.
Policy exemption groups
Some users may need to be exempt from masking and row-level policy enforcement. When you add user accounts to the configured exemption group in Databricks, Immuta will not enforce policies for those users. Exemption groups are created when the Unity Catalog integration is configured, and no policies will apply to these users' queries, despite any policies enforced on the tables they query.
The principal used to register data sources in Immuta will be automatically added to this exemption group for that Databricks table. Consequently, users added to this list and used to register data sources in Immuta should be limited to service accounts.
Policy support with hive_metastore
When enabling Unity Catalog support in Immuta, the catalog for all Databricks data sources will be updated
to point at the default hive_metastore
catalog. Internally, Databricks exposes this catalog as a proxy to
the workspace-level Hive metastore that schemas and tables were kept in before Unity Catalog. Since this
catalog is not a real Unity Catalog catalog, it does not support any Unity Catalog policies. Therefore,
Immuta will ignore any data sources in the hive_metastore
in any Databricks Unity Catalog integration,
and policies will not be applied to tables there.
However, with
Databricks metastore magic
you can use hive_metastore
and enforce subscription and data policies with the
Databricks Spark integration.
Authentication methods
The Databricks Unity Catalog integration supports the following authentication methods to configure the integration and create data sources:
- Personal access token (PAT): This is the access token for the Immuta service principal. This service principal must have the metastore privileges listed in the permissions section for the metastore associated with the Databricks workspace. If this token is configured to expire, update this field regularly for the integration to continue to function.
- OAuth machine-to-machine (M2M): Immuta uses the Client Credentials Flow to integrate with Databricks OAuth machine-to-machine authentication, which allows Immuta to authenticate with Databricks using a client secret. Once Databricks verifies the Immuta service principal’s identity using the client secret, Immuta is granted a temporary OAuth token to perform token-based authentication in subsequent requests. When that token expires (after one hour), Immuta requests a new temporary token. See the Databricks OAuth machine-to-machine (M2M) authentication page for more details.
Immuta data sources in Unity Catalog
The Unity Catalog data object model introduces a 3-tiered namespace, as outlined above. Consequently, your Databricks tables registered as data sources in Immuta will reference the catalog, schema (also called a database), and table.
External data connectors and query-federated tables
External data connectors and query-federated tables are preview features in Databricks. See the Databricks documentation for details about the support and limitations of these features before registering them as data sources in the Unity Catalog integration.
Native query audit
Access requirements
For Databricks Unity Catalog audit to work, Immuta must have, at minimum, the following access.
USE CATALOG
on thesystem
catalogUSE SCHEMA
on thesystem.access
schema-
SELECT
on the following system tables:system.access.audit
system.access.table_lineage
system.access.column_lineage
The Databricks Unity Catalog integration audits user queries run in clusters or SQL warehouses for deployments configured with the Databricks Unity Catalog integration. The audit ingest is set when configuring the integration and the audit logs can be scoped to only ingest specific workspaces if needed.
See the Unity Catalog native audit page for details about manually prompting ingest of audit logs and the contents of the logs.
Tag ingestion
Design partner preview
This feature is only available to select accounts. Reach out to your Immuta representative to enable this feature.
You can enable tag ingestion to allow Immuta to ingest Databricks Unity Catalog table and column tags so that you can use them in Immuta policies to enforce access controls. When you enable this feature, Immuta uses the credentials and connection information from the Databricks Unity Catalog integration to pull tags from Databricks and apply them to data sources as they are registered in Immuta. If Databricks data sources preexist the Databricks Unity Catalog tag ingestion enablement, those data sources will automatically sync to the catalog and tags will apply. Immuta checks for changes to tags in Databricks and syncs Immuta data sources to those changes every 24 hours.
Once external tags are applied to Databricks data sources, those tags can be used to create subscription and data policies.
To enable Databricks Unity Catalog tag ingestion, see the Configure a Databricks Unity Catalog integration page.
Syncing tag changes
After making changes to tags in Databricks, you can manually sync the catalog so that the changes immediately apply to the data sources in Immuta. Otherwise, tag changes will automatically sync within 24 hours.
When syncing data sources to Databricks Unity Catalog tags, Immuta pulls the following information:
- Table tags: These tags apply to the table and appear on the data source overview tab. Databricks tags' key and value pairs are reflected in Immuta as a hierarchy with each level separated by a
.
delimiter. For example, the Databricks Unity Catalog tagLocation: US
would be represented asLocation.US
in Immuta. - Column tags: These tags are applied to data source columns and appear on the columns listed in the data dictionary tab. Databricks tags' key and value pairs are reflected in Immuta as a hierarchy with each level separated by a
.
delimiter. For example, the Databricks Unity Catalog tagLocation: US
would be represented asLocation.US
in Immuta. - Table comments field: This content appears as the data source description on the data source overview tab.
- Column comments field: This content appears as dictionary column descriptions on the data dictionary tab.
Limitations
- Only tags that apply to Databricks data sources in Immuta are available to build policies in Immuta. Immuta will not pull tags in from Databricks Unity Catalog unless those tags apply to registered data sources.
- Cost implications: Tag ingestion in Databricks Unity Catalog requires compute resources. Therefore, having many Databricks data sources or frequently manually syncing data sources to Databricks Unity Catalog may incur additional costs.
- Databricks Unity Catalog tag ingestion only supports tenants with fewer than 2,500 data sources registered.
Configuration requirements
See the Enable Unity Catalog guide for a list of requirements.
Supported Databricks cluster configurations
The table below outlines the integrations supported for various Databricks cluster configurations. For example, the only integration available to enforce policies on a cluster configured to run on Databricks Runtime 9.1 is the Databricks Spark integration.
Example cluster | Databricks Runtime | Unity Catalog in Databricks | Databricks Spark integration | Databricks Unity Catalog integration |
---|---|---|---|---|
Cluster 1 | 9.1 | Unavailable | Unavailable | |
Cluster 2 | 10.4 | Unavailable | Unavailable | |
Cluster 3 | 11.3 | / | Unavailable | |
Cluster 4 | 11.3 | |||
Cluster 5 | 11.3 |
Legend:
- The feature or integration is enabled.
- The feature or integration is disabled.
Unity Catalog caveats
- Unity Catalog row- and column-level security controls are unsupported for single-user clusters. See the Databricks documentation for details about this limitation.
- Row access policies with more than 1023 columns are unsupported. This is an underlying limitation of UDFs in Databricks. Immuta will only create row access policies with the minimum number of referenced columns. This limit will therefore apply to the number of columns referenced in the policy and not the total number in the table.
- If you disable table grants, Immuta revokes the grants. Therefore, if users had access to a table before enabling Immuta, they’ll lose access.
-
You must use the global regex flag (
g
) when creating a regex masking policy in this integration, and you cannot use the case insensitive regex flag (i
) when creating a regex masking policy in this integration. See the examples below for guidance:- regex with a global flag (supported):
/^ssn|social ?security$/g
- regex without a global flag (unsupported):
/^ssn|social ?security$/
- regex with a case insensitive flag (unsupported):
/^ssn|social ?security$/gi
- regex without a case insensitive flag (supported):
/^ssn|social ?security$/g
- regex with a global flag (supported):
Feature limitations
The following features are currently unsupported:
- Databricks change data feed support
- Immuta projects (Reach out to your Immuta representative to enable support for masked joins to allow you to join masked columns of data sources within a project.)
- Multiple IAMs on a single cluster
- Column masking policies on views
- Mixing masking policies on the same column
- Row-redaction policies on views
- R and Scala cluster support
- Scratch paths
- User impersonation
- Policy enforcement on raw Spark reads
- Python UDFs for advanced masking functions
- Direct file-to-SQL reads
- Data policies (except for masking with NULL) on ARRAY, MAP, or STRUCT type columns
Known issues
- Snippets for Databricks data sources may be empty in the Immuta UI.