Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Data quality scans review your data assets based on their applied data quality rules and produce a score. Your data stewards can use that score to assess the data health and address any issues that might be lowering the quality of your data.
Prerequisites
- To run and schedule data quality assessment scans, users need the data quality steward role.
- Currently, you can set the Microsoft Purview account to allow public access or managed virtual network access so that data quality scans can run.
Data quality life cycle
Data quality scanning is the seventh step in the data quality life cycle for a data asset. The previous steps are:
- Assign users data quality steward permissions in Microsoft Purview Unified Catalog so they can use all data quality features.
- Register and scan a data source in Microsoft Purview Data Map.
- Add your data asset to a data product
- Set up a data source connection to prepare your source for data quality assessment.
- Configure and run data profiling for an asset in your data source.
- When profiling is complete, browse the results for each column in the data asset to understand your data's current structure and state.
- Set up data quality rules based on the profiling results, and apply them to your data asset.
Supported multicloud data sources
Browse the supported data source document to view the list of supported data sources, including file formats for data profiling and data quality scanning, with and without virtual network support.
Important
Data quality for Parquet file is designed to support:
- A directory with Parquet Part File. For example: ./Sales/{Parquet Part Files}. The Fully Qualified Name must follow
https://(storage account).dfs.core.windows.net/(container)/path/path2/{SparkPartitions}. Make sure there are no {n} patterns in directory/sub-directory structure. It must be a direct FQN leading to {SparkPartitions}. - A directory with Partitioned Parquet Files, partitioned by Columns within the dataset like sales data partitioned by year and month. For example: ./Sales/{Year=2018}/{Month=Dec}/{Parquet Part Files}.
Both of these essential scenarios, which present a consistent parquet dataset schema, are supported. Limitation: It isn't designed to or won't support N arbitrary Hierarchies of Directories with Parquet Files. We recommend presenting data in (1) or (2) constructed structure.
Supported authentication methods
Currently, Microsoft Purview can only run data quality scans by using Managed Identity as authentication option. Data quality services run on Apache Spark 3.4 and Delta Lake 2.4. For more information about supported regions, see data quality overview.
Important
- If you update the schema on the data source, you need to rerun data map scan before running a data quality scan. You can also use schema import feature from the data quality overview page.
- Schema import isn't supported for data sources running on managed Virtual Network or private endpoint.
- Virtual network isn't supported for Google BigQuery.
Run a data quality scan
Configure a data source connection to the assets you're scanning for data quality if you haven't already done so.
In Unified Catalog, select Health Management, then select Data quality.
Select a governance ___domain from the list.
Select a data product to assess the data quality of the data assets linked to that product.
Select the name of a data asset, which takes you to the data quality Overview page.
Browse the existing data quality rules and add new rules by selecting Rules. Browse the schema of the data asset by selecting Schema. Toggle on or off the rules you added.
Run the quality scan by selecting Run quality scan on the overview page.
While the scan is running, you can track its progress from the data quality monitoring page in the governance ___domain.
Schedule data quality scans
Although you can run data quality scans on an ad-hoc basis by selecting Run quality scan, in production scenarios the source data is likely to be constantly updated. You should regularly monitor data quality to detect any issues. Automating the scanning process helps you manage regular updates of quality scans.
In Unified Catalog, select Health Management, then select Data quality.
Select a governance ___domain from the list.
Select Manage, then select Scheduled scans.
Fill out the form on the Create scheduled scan page. Add a name and description for the source you're setting up the schedule.
Select Continue.
On the Scope tab, select individual data product and assets or all data products and data assets of the entire governance ___domain.
Select Continue.
Set a schedule based on your preferences and select Continue.
On the Review tab, select Save (or Save and run to test immediately) to complete scheduling the data quality assessment scan.
You can monitor scheduled scans on the data quality job monitoring page under the Scans tab.
Note
You can't add more than 30 assets across all data products in a single schedule. Create multiple schedules for 30 assets per batch. You can configure to run multiple schedules in the same time window.
Delete previous data quality scans and history
When you remove a data asset from a data product, if that data asset has a data quality score, you need to delete the data quality score first, then remove the data asset from the data product.
When you delete data quality history data, it removes the profile history, the data quality scan history, and data quality rules, but data quality actions aren't deleted.
Follow the steps below to delete previous data quality scans of a data asset:
- In Unified Catalog, select Health Management, then select Data quality.
- Select a governance ___domain from the list.
- Select the data product from the list.
- Select the data asset from the list to navigate the Data quality overview page.
- Select the ellipsis (...) at the top right of the Data quality overview page.
- Select Delete data quality data to delete the history of data quality runs.
Note
- Use Delete data quality data for test runs, errored data quality runs, or if you're removing a data asset from a data product.
- The system stores up to 50 snapshots of data quality profiling and data quality assessment history. If you want to delete a specific snapshot, select the desired history run and select the delete icon.
Schema import
If the data type in a schema is undefined, incorrectly defined, or changed in the source, your data quality job might fail. If it fails, reimport the schema by using the schema import capability. Schema import is supported for data sources on both public networks and behind private endpoints. The supported data sources are listed at Data sources and file formats supported for data quality. To import a schema from your data sources, follow these steps:
- Select Data quality from Health Management.
- Select a business ___domain, then select a data product, then select a data asset from that data product. You arrive at the data quality overview page.
- Select Schema, then select the Schema management toggle.
- Select Import schema to import the schema.
Related content
- Data quality for Fabric data estate
- Data quality for Fabric Mirrored data sources
- Data quality for Fabric shortcut data sources
- Data quality for Azure Synapse serverless and data warehouses
- Data quality for Azure Databricks Unity Catalog
- Data quality for Snowflake data sources
- Data quality for Google BigQuery
Next steps
- Monitor data quality scan
- Review your scan results to evaluate your data product's current data quality.
- Configure alerts for data quality scan results