Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
A Bloom filter index is a space-efficient data structure that enables data skipping on chosen columns. It is especially effective for columns with many distinct values, such as free-form text or IDs.
Limitations of Bloom filter indexes
Important
Azure Databricks doesn't recommend using Bloom filter indexes for most workloads.
Although Bloom filter indexes support data skipping on columns with high cardinality of unique values, they often introduce issues such as the following:
- They only index files written after the index is created. The indexes don't apply to existing files.
- They're difficult to tune and often add overhead without significant performance benefit.
- They work best with highly selective equality filters on a single column. They provide little to no benefit for broader filters.
- Bloom filter indexes only work on explicitly defined columns. If your queries start filtering on columns that don't overlap with the Bloom filter index columns, you must manually drop and recreate the index, adding operational overhead. You also need to rebuild the new index from scratch with the new data.
While Bloom filter indexes aim to skip file reads, Azure Databricks must still scan the index file for every data file. Z-ordering and liquid clustering are more effective because they let the engine avoid considering most files.
Recommended alternatives
Azure Databricks recommends using the following features instead of Bloom filter indexes:
- Predictive I/O - On Photon-enabled compute with Databricks Runtime 12.2 and above, predictive I/O performs file skipping on all columns using advanced heuristics. It fully subsumes Bloom filter indexes, which only add write overhead when Photon is enabled.
- Liquid clustering In Databricks Runtime 13.3 and above, liquid clustering improves data skipping by organizing data based on frequently filtered columns. It works at a coarser granularity than Bloom filter indexes and allows Azure Databricks to skip entire groups of files during query planning.
Important
Avoid combining Bloom filter indexes with predictive I/O. When both are enabled, Bloom filter indexes take precedence and disable predictive I/O, which delivers faster and more reliable performance.
How Bloom filter indexes work
Azure Databricks Bloom filter indexes consist of a data skipping index for each data file. The Bloom filter index can be used to determine that a column value is definitively not in the file, or that it is probably in the file. Before reading a file Azure Databricks checks the index file, and the file is read only if the index indicates that the file might match a data filter.
Bloom filter indexes support columns with the following input data types: byte
, short
, int
, long
, float
, double
, date
, timestamp
, and string
. Nulls are not added to the Bloom filter index, so any null related filter requires reading the data file. Azure Databricks supports the following data source filters: and
, or
, in
, equals
, and equalsnullsafe
. Bloom filter indexes are not supported on nested columns.
Configuration and reference
Use the following syntax to enable a Bloom filter:
CREATE BLOOMFILTER INDEX
ON TABLE table_name
FOR COLUMNS(column_name OPTIONS (fpp=0.1, numItems=5000))
For syntax details, see CREATE BLOOM FILTER INDEX and DROP BLOOM FILTER INDEX.
To disable Bloom filter operations, set the session level spark.databricks.io.skipping.bloomFilter.enabled
configuration to false
.
Display the list of Bloom filter indexes
To display the list of indexes, run:
spark.table("<table-with-indexes>").schema.foreach(field => println(s"${field.name}: metadata=${field.metadata}"))
For example: