Using Parquet’s Bloom Filters
In this article, the use and effectiveness of Bloom filters in Apache Parquet files are explored. The impact of Bloom filters on written Parquet files is measured, particularly when dealing with large quantities of high-cardinality data. Results show that moderate Bloom filter parameters (FPP of 0.01 and NDV of 1,000) yielded optimal pruning efficiency at a cost of 2 KB to 8 KB per column per row group in storage space. Query times were reduced to 1/30th of the time using Bloom filters. The chosen FPP should correspond to the amount of pruning expected from the Bloom filter, and an underestimated NDV can save storage space without affecting pruning efficiency. Experiments also demonstrated that DataFusion successfully prunes all non-matching row groups at NDV 1,000, adding only ~2K overhead per row group.
Company
InfluxData
Date published
May 28, 2024
Author(s)
Trevor Hilton
Word count
3188
Language
English
Hacker News points
None found.