Data Sources in Immuta
Data owners expose their data across their organization to other users by registering that data in Immuta as a data source.
By default, data owners can register data in Immuta without affecting existing policies on those tables in their remote system, so users who had access to a table before it was registered can still access that data without interruption. If this default behavior is disabled on the App Settings page, a subscription policy that requires data owners to manually add subscribers to data sources will automatically apply to new data sources (unless a Global policy you create applies), blocking access to those tables.
For information about this default subscription policy and how to manage it, see the default subscription policy page.
Click a link below to navigate to a tutorial that details how to create that data source:
- Snowflake data sources
- Databricks data sources
- Starburst data sources
- Redshift data sources
- Azure Synapse Analytics data sources
Data Sources With Nested Columns
You can create Databricks data sources with nested columns when you enable complex data types. When complex types are enabled, Databricks data sources can have columns that are arrays, maps, or structs that can be nested. These columns get parsed into a nested data dictionary.
Data Source Health Checks
When an Immuta data source is created, background jobs use the connection information provided to compute health checks dependent on the type of data source created and how it was configured. These data source health checks include the
- blob crawl status: indicates whether the blob was successfully crawled. If this check fails, the overall
health status of the data source will be
- column detection status: indicates whether the job run to determine if a column was added or removed from the remote table registered as an Immuta data source was successful.
- external catalog link status: indicates whether or not the external catalog was successfully linked to the data
source. If this check fails, the overall health status of the data source will be
- fingerprint generation status: indicates whether or not the data source fingerprint was successfully generated.
- framework classification status: indicates whether classification was successfully run on the data source to determine the sensitivity of the data source.
- global policy applied status: indicates whether global policies were successfully applied to the data source.
- high cardinality calculation status: indicates whether the data source's high cardinality column was successfully calculated.
- native SQL sync status (for Snowflake data sources): indicates whether Snowflake governance policies have been successfully synced.
- native SQL view creation status (for Snowflake and Redshift data sources): indicates whether native views were properly created for Redshift and Snowflake tables registered in Immuta.
- row count status: indicates whether the number of rows in the data source was successfully calculated.
- schema detection status: indicates whether the job run to determine if a remote table was added or removed from the schema was successful.
- sensitive data discovery status: indicates whether sensitive data discovery was successfully run on the data source.
After these jobs complete, the health status for each is updated to indicate whether the status check passed, was skipped, is unknown, or failed.
These background jobs can be disabled during data source creation by adding a specific tag to prevent automatic table statistics. This prevent statistics tag can be set on the App Settings page by a System Administrator. However, with automatic table statistics disabled these policies will be unavailable until the Data Source Owner manually generates the fingerprint:
- Masking with format preserving masking
- Masking with K-Anonymization
- Masking using randomized response
Unhealthy Databricks Data Sources
Unhealthy data sources may fail their row count queries if they run against a cluster that has the Databricks query watchdog enabled.
Data sources with over 1600 columns will not have health checks run, but will still appear as healthy. The health check cannot be run automatically or manually.
Data Source User Roles
There are various roles users and groups can play relating to each data source. These roles are managed through the Members tab of the Data Source. They include
- Owners: Those who create and manage new data sources and their users, documentation, Data Dictionaries, and queries. They are also capable of ingesting data into their data sources as well as adding ingest users (if their data source is object-backed).
- Subscribers: Those who have access to the data source data. With the appropriate data accesses and attributes, these users/groups can view files, run SQL queries, and generate analytics against the data source data. All users/groups granted access to a data source (except for those with the ingest role) have subscriber status.
- Experts: Those who are knowledgeable about the data source data and can elaborate on it. They are responsible for managing the data source's documentation and the data dictionary.
- Ingest: Those who are responsible for ingesting data for the data source. Ingest users cannot access any data once it's inside Immuta, but they are able to verify if their data was successfully ingested or not.
See Manage Data Sources for a tutorial on modifying user roles.
The Data Dictionary provides information about the columns within the data source, including column names and value types. Users subscribed to the data source can post and reply to discussion threads by commenting on the Data Dictionary.