Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Applies to: ✅ SQL analytics endpoint and Warehouse in Microsoft Fabric
Data clustering is a technique used to organize and store data based on similarity. Data clustering improves query performance and reduces compute and storage access costs for queries by grouping similar records together.
How it works
Data clustering works by storing rows with similar values in adjacent locations on storage during ingestion. Data clustering uses a space-filling curve to organize data in a way that preserves locality across multiple dimensions, meaning rows with similar values across clustering columns are stored physically close together. This approach dramatically improves query performance by performing file skipping and reducing the number of files that are scanned.
Unlike conventional lexicographical ordering, data clustering uses a sophisticated algorithm to ingest, keeping rows with similar column values close together, even when a table is clustered by several columns. This makes data clustering ideal for range queries, high-cardinality filters, and large tables with skewed distributions, resulting in faster reads, reduced I/O, and more efficient resource usage.
Here's a simplified conceptual illustration of data clustering:
In this diagram, a table labeled Source data shows rows mixed and highlighted in different colors to represent clustering groupings on destination. An Ordered table is split into three file segments, each grouping rows by similar colors, demonstrating how clustering organizes data into optimized storage segments based on column values.
Data clustering metadata is embedded in the manifest during ingestion, allowing the warehouse engine to make intelligent decisions about which files to access during user queries. This metadata, combined with how rows with similar values are stored together ensures that queries with filter predicates can skip entire files and row groups that fall outside the predicate scope. For example: if a query targets only 10% of a table's data, clustering ensures that only files that contain the data within the filter's range are scanned, reducing I/O and compute consumption. Larger tables benefit more from data clustering, as the benefits of file skipping scale with data volume.
When to use data clustering
When deciding if data clustering could be beneficial, investigate query patterns and table characteristics in the warehouse. Data clustering is most effective when queries repeatedly filter on specific columns and when the underlying tables are large and contain mid-to-high cardinality data. Some common scenarios include:
- Repeated queries with
WHEREfilters: If the workload includes frequent queries filtering specific columns, data clustering ensures that only relevant files are scanned during read queries. This also applies when the filters are used repeatedly in dashboards, reports, or scheduled jobs and pushed down to the warehouse engine as SQL statements. - Larger tables: data clustering is most effective when applied to large tables where scanning the full dataset is costly. By organizing rows with data clustering, the warehouse engine can skip entire files and row groups that don't match the query filter, which can reduce I/O and compute usage.
- Mid-to-high cardinality columns: columns with higher cardinality (for example: columns that have many distinct values, such as an ID, or a date) benefit more from data clustering because they allow the engine to isolate and colocate similar values. This enables efficient file skipping, especially for selective queries. Columns with low cardinality (for example: gender, region) by nature has its values spread across more files, therefore offering limited opportunities for file-skipping.
- Selective queries with narrow scope: when queries typically target a small subset of data and are combined with a WHERE filter, data clustering ensures that only files that contain the relevant rows are read.
Data clustering happens automatically during data ingestion, regardless of how rows were ingested. No user operations are required after data is ingested to apply data clustering.
CLUSTER BY Syntax
Data clustering is defined during table creation, using the CLUSTER BY clause. The syntax is as follows:
CREATE TABLE (Transact-SQL) syntax:
CREATE TABLE { warehouse_name.schema_name.table_name | schema_name.table_name | table_name } (
[ ,... n ] –- Column list
) WITH (CLUSTER BY [ ,... n ]);
CREATE TABLE AS SELECT (Transact-SQL) syntax:
CREATE TABLE { warehouse_name.schema_name.table_name | schema_name.table_name | table_name } (
) WITH (CLUSTER BY[ ,... n ])
AS <select_statement>;
The CLUSTER BY clause requires at least one column to be specified for data clustering, and a maximum of four columns.
Creating a table that uses data clustering with SELECT INTO isn't supported.
Data type support
The following table summarizes column types that can be used in the CLUSTER BY clause:
| Category | Data type | Data clustering supported |
|---|---|---|
| Exact numerics | bit | No |
| Exact numerics | bigint, int, smallint, decimal2, numeric | Yes |
| Approximate numerics | float, real | Yes |
| Date and time | date, datetime2, time | Yes |
| Character strings1 | char | Yes |
| Character strings1 | varchar | Yes |
| LOB types | varchar(max), varbinary(max) | No |
| Binary strings | varbinary, uniqueidentifer | No |
1 For string types (char/varchar), only the first 32 characters are used when column statistics are produced. As a result, columns with values that contain long prefixes might have limited benefits with data clustering.
2 For decimal types with precision greater than 18, predicates aren't pushdown to storage during query execution. If using decimal types with data clustering, favor columns with smaller precision.
Columns with unsupported data types can still exist in a table that uses data clustering, but can't be used with CLUSTER BY.
Best practices with data clustering
Data clustering is more effective when clustering columns are chosen based on actual query patterns, particularly those with mid-to-high cardinality and when range predicates are used during queries.
Consider the following best practices when using data clustering:
- Data clustering is more effective on large tables.
- Whenever possible, batch ingestion and updates to process a larger number of rows at once, rather than using smaller tasks. For optimal performance, DML operations should have at least 1 million rows to benefit from data clustering. After consecutive inserts, updates, and deletes, data compaction can consolidate rows from smaller files into optimally sized ones.
- Choose columns with mid-to-high cardinality for data clustering, as they yield better results due to their distinct value distribution. Columns with low cardinality might offer limited opportunities for file pruning.
- Select columns based on frequent usage of
WHEREpredicates in dashboards, reports, scheduled jobs, or user queries. Equality join conditions don't benefit from data clustering. For an overview of how to use Query Insights to help pick columns for data clustering based on your current workload, refer to Tutorial: Using Data Clustering in Fabric Data Warehouse. - Don't use data clustering by more columns than strictly necessary. Multi-column clustering adds complexity to storage, adds overhead, and might not offer benefits unless all the columns are used together in queries with predicates.
- The column order used in
CLUSTER BYisn't important and doesn't change how rows are stored. - When creating a table with data clustering using
CREATE TABLE AS SELECT(CTAS) or ingesting data withINSERT INTO ... SELECT, keep the select portion of these statements as simple as possible for optimal data clustering quality.
Data clustering can significantly reduce the costs during queries, if well aligned with query predicates. However, data ingestion incurs more time and capacity units (CU) on a table that uses data clustering when compared with an equivalent table with the same data without data clustering. This happens because the warehouse engine needs to order data during ingestion. Since data that is ingested is read multiple times, data clustering can reduce the overall compute consumption of a given workload.
System views
Data clustering metadata can be queried using sys.index_columns. It shows all columns used in data clustering, including the column ordinal used in the CLUSTER BY clause.
The following query lists all columns used in data clustering on the current warehouse, and their tables:
SELECT
t.name AS table_name,
c.name AS column_name,
ic.data_clustering_ordinal AS clustering_ordinal
FROM sys.tables t
JOIN sys.columns c
ON t.object_id = c.object_id
JOIN sys.index_columns ic
ON c.object_id = ic.object_id
AND c.column_id = ic.column_id
WHERE ic.data_clustering_ordinal > 0
ORDER BY
t.name,
ic.data_clustering_ordinal;
Note
The column ordinal is displayed for reference only as the order used in CLUSTER BY when the table was defined. As discussed in Best practices, the column order doesn't affect performance.
Limitations and Remarks
- Data ingestion performance might degrade when tables contain large varchar columns with highly variable data sizes.
- For example, consider a table with a varchar(200) column: if some rows contain only a few characters while others approach the maximum length, the significant variance in data size can negatively impact ingestion speed.
- This issue is known and will be addressed in an upcoming release.
IDENTITYcolumns can't be used withCLUSTER BY. Tables that contain anIDENTITYcolumn can still be used for data clustering, given that it uses different columns withCLUSTER BY.- Data clustering must be defined at table creation. Converting a regular table into one with
CLUSTER BYisn't supported. Similarly, modifying the clustering columns after a table has been created isn't allowed. If different clustering columns are needed, optionally useCREATE TABLE AS SELECT(CTAS) to create a new table with the desired clustering columns. - In some cases, data clustering might be applied asynchronously. In such cases, data is reorganized with a background task, and the table might not be fully optimized when ingestion finishes. This can happen under the following conditions:
- When using
INSERT INTO ... SELECTorCREATE TABLE AS SELECT (CTAS)and the collation of the source and target tables are different. - When ingesting from external data that is of a compressed CSV format.
- When an ingestion statement has fewer than 1 million rows.
- When using
- Data ingestion on data clustering tables incurs an overhead when compared with a table with the same schema that doesn't use data clustering. This happens due to extra computation needed to optimize storage. When the clustering column has a case insensitive collation, more overhead is also expected.
- Data clustering can benefit query response time, capacity unit (CU) consumption, or both.
Examples
A. Create a clustered table for sales data
This example creates a simple Sales table and uses the CustomerID and SaleDate columns for data clustering.
CREATE TABLE Sales (
SaleID INT,
CustomerID INT,
SaleDate DATE,
Amount DECIMAL(10,2)
) WITH (CLUSTER BY (CustomerID, SaleDate))
B. Create a clustered table using CREATE TABLE AS SELECT
This example uses CREATE TABLE AS SELECT to create a copy of the Sales existing table, with CLUSTER BY the SaleDate column.
CREATE TABLE Sales_CTAS
WITH (CLUSTER BY (SaleDate))
AS SELECT * FROM Sales
C. View the columns used for Data Clustering on a given table
This example lists the columns used for data clustering in the Sales table.
SELECT
c.name AS column_name,
ic.data_clustering_ordinal AS clustering_ordinal
FROM sys.tables t
JOIN sys.columns c
ON t.object_id = c.object_id
JOIN sys.index_columns ic
ON c.object_id = ic.object_id
AND c.column_id = ic.column_id
WHERE
ic.data_clustering_ordinal > 0
AND t.name = 'Sales'
ORDER BY
t.name,
ic.data_clustering_ordinal;
Results:
D. Check the effectiveness of the column choices for data clustering
Query Insights can help evaluate the effect of data clustering on your workload by comparing CPU time and data scanned between a given query and its equivalent run on a clustered copy of the original table. The following example illustrates how to retrieve the allocated CPU time and the volume of data scanned across disk, memory, and remote storage for a specific query.
SELECT
allocated_cpu_time_ms,
data_scanned_disk_mb,
data_scanned_memory_mb,
data_scanned_remote_storage_mb
FROM
queryinsights.exec_requests_history
WHERE
distriubted_statement_id = '<Query_Statement_ID>'
Where <Query_Statement_ID> is the distributed statement ID of the query you want to evaluate.