Schema Monitoring
Schema monitoring allows organizations to monitor their data environments. When it is enabled, Immuta monitors the organization's servers to detect when new tables or columns are created or deleted, and automatically registers (or disables) those tables in Immuta. These newly updated data sources will then have any global policies and tags that are set in Immuta applied to them. The Immuta data dictionary will be updated with any column changes, and the Immuta environment will be in sync with the organization's data environment. This automated process helps organization keep compliant without the need to manually keep data sources up to date.
Schema monitoring is enabled while creating or editing a data source and only registers new tables and columns within known schemas. It does not register new schemas. Data owners or governors can edit the naming convention for newly detected data sources and the schema detection owner from the schema project page after it has been enabled.
See the Register a data source guides for instructions on enabling schema monitoring or Manage schema monitoring for instructions on editing the schema monitoring settings.
Column detection
Column detection is a part of schema monitoring, but can also be enabled on its own to detect the column changes of a select group of tables. Column detection monitors when columns are added or removed from a table and when column types are changed and updates those changes in the appropriate Immuta data source's data dictionary.
See one of the Register a data source guides for instructions on enabling column detection.
Tracking new data sources and columns
When new data sources and columns are detected and added to Immuta, they will automatically be tagged with the New
tag. This allows governors to use the
seeded New Column Added
global policy
to mask the data sources and columns, since they could contain sensitive data. Data owners can then review and
approve these changes from the requests tab of their profile page.
Approving column changes removes the New
tags from the data source.
The New Column Added
global policy is staged (inactive) by default.
See the Clone, activate, or stage a global policy guide to activate this seeded global policy if you want any new columns to be automatically masked.
Workflow
- Immuta user registers a data source with schema monitoring enabled.
- Every 24 hours, at 12:30 a.m. UTC by default, Immuta checks the servers for any changes to tables and columns.
-
If Immuta finds a change, it will update the appropriate Immuta data source or column:
- If Immuta finds a new table, then Immuta creates an Immuta data source for that table and tags it
New
. - If Immuta finds a table has been deleted, then Immuta disables that table's data source.
- If Immuta finds a previously deleted table has been re-created, then Immuta restores that table's data source
and tags it
New
. - If Immuta finds a new column within a table, then Immuta adds that column to the data dictionary and tags it
New
. - If Immuta finds a column has been deleted, then Immuta deletes that column from the data dictionary.
- If Immuta finds a column type has changed, then Immuta updates the column type in the data dictionary.
- If Immuta finds a new table, then Immuta creates an Immuta data source for that table and tags it
-
Data sources and columns tagged
New
will be masked by the seededNew Column Added
global policy until a governor or data owner approves the changes.
To run schema monitoring or column detection manually, see the Run schema monitoring and column detection jobs page.
Performance
The default schedule for schema monitoring to run is every 24 hours at midnight. Some organizations may need to schedule it to run more often; however, this could have negative performance impacts. Native schema monitoring for Snowflake has increased performance and should be used whenever possible.
Schema monitoring for Databricks
In most cases, Immuta’s schema monitoring job runs automatically from the Immuta web service. For Databricks, that automatic job is disabled because of the ephemeral nature of Databricks clusters. In this case, Immuta requires users to download a schema detection job template (a Python script) and import that into their Databricks workspace. See the Register a Databricks data source guide for details.
Native schema monitoring for Snowflake
Immuta can monitor your data environment, detect when new tables or columns are created or deleted in Snowflake, and automatically register (or disable) those tables in Immuta for you. Those newly updated data sources will then have any global policies and tags that you have set up applied to them. The Immuta data dictionary will be updated with any new columns, and your Immuta environment will be in sync with your Snowflake tables. This automated process helps with scaling and keeping your organization compliant without the need to manually keep your data sources up to date.
Architecture
Once enabled on a data source, Immuta calls to Snowflake every 24 hours by default to find when each table within the registered schema was last altered. If the timestamp is after the last time native schema monitoring was run, then Immuta will update the table or columns that have been altered. This process works well when monitoring a large number of data sources because it only updates the recently altered tables and cuts down the amount of Snowflake computing required to run column detection, which specifically updates the columns of registered data sources.
If you have an Immuta environment with data sources other than Snowflake, the legacy schema monitoring feature will run on all non-Snowflake data sources. The native schema monitoring feature only works with Snowflake integrations and Snowflake data sources.
Automatic workflow
- Immuta user creates a data source with schema monitoring enabled.
- Every 24 hours, at 12:30 a.m. UTC by default, Immuta sends a query to Snowflake for the
information_schema
view asking for when each data source’s table was last altered. To adjust these settings, reach out to your Immuta representative. - If the table was altered after the last time native schema detection ran, Immuta updates the data source, columns, and data dictionary.
- Immuta tags new data sources and columns with the tag “New” so that you can use the templated "New Column Added" global policy to mask all new data until it has been reviewed.
Limitations
- This feature only works with Snowflake data sources. Any non-Snowflake data sources will run with the legacy schema monitoring described above.
- Your organization will not see performance improvements if it is making changes to all tables consistently. This feature is intended to improve performance for organizations with a large number of tables and relatively few changes made within the ecosystem comparatively.
Migration
There is no migration required for this feature. Native schema monitoring will run on all Snowflake data sources with legacy schema monitoring previously enabled and will run on all new Snowflake data sources with schema monitoring enabled.
Configuration
There is no additional configuration required for this feature. You just need to enable schema monitoring when you create your Snowflake data sources.
Schema monitoring best practices
- Manually trigger schema monitoring (filtered down to the database) after your dbt or other transform workflows run. For more information, see the dbt and transform workflow for limited policy downtime guide.
- When manually triggering schema monitoring, specify a table or database for maximum performance efficiency and to reduce data or policy downtime. For more information on triggering schema monitoring, see the Manually run schema monitoring guide.
- If you are manually managing data tags, activate the "New Column Added" global policy to protect newly found and potentially sensitive data. This policy sets all new columns to NULL until a data owner reviews the new columns. Using this workflow protects your data and avoids data leaks on new columns getting automatically added. This recommendation is unnecessary for users leveraging sensitive data discovery (SDD) or using an external data catalog.