Share via


Bloom filter indexes

A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns. It is especially effective for columns with many distinct values, such as free-form text or IDs.

Limitations of Bloom filter indexes

Important

Azure Databricks doesn't recommend using Bloom filter indexes for most workloads.

Although Bloom filter indexes support data skipping on columns with high cardinality of unique values, they often introduce issues such as the following:

  • They only index files written after the index is created. The indexes don't apply to existing files.
  • They're difficult to tune and often add overhead without significant performance benefit.
  • They work best with highly selective equality filters on a single column. They provide little to no benefit for broader filters.
  • Bloom filter indexes only work on explicitly defined columns. If your queries start filtering on columns that don't overlap with the Bloom filter index columns, you must manually drop and recreate the index, adding operational overhead. You also need to rebuild the new index from scratch with the new data.

While Bloom filter indexes aim to skip file reads, Azure Databricks must still scan the index file for every data file. Z-ordering and liquid clustering are more effective because they let the engine avoid considering most files.

Azure Databricks recommends using the following features instead of Bloom filter indexes:

  • Predictive I/O - On Photon-enabled compute with Databricks Runtime 12.2 and above, predictive I/O performs file skipping on all columns using advanced heuristics. It fully subsumes Bloom filter indexes, which only add write overhead when Photon is enabled.
  • Liquid clustering In Databricks Runtime 13.3 and above, liquid clustering improves data skipping by organizing data based on frequently filtered columns. It works at a coarser granularity than Bloom filter indexes and allows Azure Databricks to skip entire groups of files during query planning.

Important

Avoid combining Bloom filter indexes with predictive I/O. When both are enabled, Bloom filter indexes take precedence and disable predictive I/O, which delivers faster and more reliable performance.

How Bloom filter indexes work

Azure Databricks Bloom filter indexes consist of a data skipping index for each data file. The Bloom filter index can be used to determine that a column value is definitively not in the file, or that it is probably in the file. Before reading a file Azure Databricks checks the index file, and the file is read only if the index indicates that the file might match a data filter.

Bloom filter indexes support columns with the following input data types: byte, short, int, long, float, double, date, timestamp, and string. Nulls are not added to the Bloom filter index, so any null related filter requires reading the data file. Azure Databricks supports the following data source filters: and, or, in, equals, and equalsnullsafe. Bloom filter indexes are not supported on nested columns.

Configuration and reference

Use the following syntax to enable a Bloom filter:

CREATE BLOOMFILTER INDEX
ON TABLE table_name
FOR COLUMNS(column_name OPTIONS (fpp=0.1, numItems=5000))

For syntax details, see CREATE BLOOM FILTER INDEX and DROP BLOOM FILTER INDEX.

To disable Bloom filter operations, set the session level spark.databricks.io.skipping.bloomFilter.enabled configuration to false.

Display the list of Bloom filter indexes

To display the list of indexes, run:

spark.table("<table-with-indexes>").schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))

For example:

Show indexes