Share via


Set up data quality for Azure Databricks Unity Catalog

To use Unity Catalog, your Azure Databricks workspace must be enabled for Unity Catalog, which means that the workspace is attached to a Unity Catalog metastore. All new workspaces are enabled for Unity Catalog automatically upon creation, but older workspaces might require that an account admin enable Unity Catalog manually. Whether or not your workspace was enabled for Unity Catalog automatically, the following steps are also required to get started with Unity Catalog:

  • Create catalogs and schemas to contain database objects like tables and volumes.
  • Create managed storage locations to store the managed tables and volumes in these catalogs and schemas.
  • Grant user access to catalogs, schemas, and database objects.

Workspaces that are automatically enabled for Unity Catalog provision a workspace catalog with broad privileges granted to all workspace users. This catalog is a convenient starting point for trying out Unity Catalog.

For detailed setup instructions, see Set up and manage Unity Catalog.

When you're scanning Azure Databricks Unity Catalog, Microsoft Purview supports:

  • Metastore
  • Catalogs
  • Schemas
  • Tables including the columns
  • Views including the columns

When setting up scan, you can choose to scan the entire Unity Catalog, or scope the scan to a subset of catalogs.

Configure Data Map scan to catalog Databricks Unity Catalog data in Microsoft Purview

  • Register an Azure Databricks workspace in Microsoft Purview
  • Scan registered Azure Databricks workspace
    • Enter the name of scan
    • Select unity catalog as extraction method
    • Connect via integration runtime (Azure integration runtime, Managed VNet IR, or a Kubernetes supported self-hosted integration runtime you created)
    • Select Access Token Authentication while creating a credential. For more information, see Credentials for source authentication in Microsoft Purview.
    • Specify the Databricks SQL Warehouse’s HTTP path that Microsoft Purview will connect to and perform the scan
    • In Scope your scan page, select the catalogs you want to scan.
    • Select a scan rule set for classification. You can choose between the system default, existing custom rule sets, or create a new rule set inline. Check the Classification article to learn more.
    • For Scan trigger, choose whether to set up a schedule or run the scan once.
    • Review your scan and select Save and Run.
  • View your scans and scan run to complete cataloging your data.

Once scanned, the data asset in Unity Catalog (UC) will be available in Microsoft Purview Unified Catalog search. Find more details about how to connect and manage Azure Databricks Unity Catalog in Microsoft Purview.

Important

  • Select Access Token Authentication while creating a credential.
  • Place Access Token on your hosted Azure Key Vault and connect the key vault to the connection manager.
  • Make sure to provide product (service) MSI read (secret) access to the Key Vault.

Set up connection to Databricks UC for data quality scan

At this point, we have the scanned asset ready for cataloging and governance. Associate the scanned asset to the Data Product in a Governance Domain Sele. At the Data Quality Tab, add a new Azure SQL Database Connection: Get the Database Name entered manually.

  1. In the Microsoft Purview portal, open Unified Catalog.

  2. Under Health management, select Data quality.

  3. Select a governance ___domain from the list, then select Connections from the Manage dropdown list.

  4. Configure connection on the Connections page:

    • Add connection name and description.
    • Select source type Azure Databricks.
    • Select Azure subscription.
    • Select workspace URL.
    • Add Databricks metastore ID.
    • Select Unity catalog as extraction method.
    • Select HTTP path.
    • Select unity catalog name.
    • Select schema name.
    • Select table name.
    • Select authentication method - Access Token
      • Add Azure subscription
      • Key vault connection
      • Secret name
      • Secret version
    • Select the Enable managed V-Net checkbox if your Databricks is running in V-Net.
    • Region is selected automatically.
    • Create a new v-net if a v-net storage hasn't yet been created.
  5. Test connection. If your Databricks storage is in vNet, then you won't able to test the connection.

Screenshot that shows how to set up databricks UC connection.

Screenshot that shows how to configure databricks connection token.

Important

  • Data quality stewards need read only access to Azure Databricks Unity Catalog to set up a data quality connection.
  • If public access is disabled, you need to select the Allow trusted Microsoft services checkbox for Key Vault. This is required only for Key Vault, not for your Azure Databricks workspace.
  • VNet support is currently in preview and available globally. It's temporarily included in the Data Governance SKUs to maintain flexibility during this phase. VNET pricing isn't yet available and might be made prior to the feature's general availability.

Profiling and Data Quality scanning for data in Azure Databricks Unity Catalog databases

After completing the connection setup successfully, you can profile, create, and apply rules, and run a data Quality scan of your data in Azure Databricks Unity Catalog databases. Follow the step-by-step guidance in these resources:

Reference documents