/plushcap/analysis/influxdata/using-parquets-bloom-filters

Using Parquet’s Bloom Filters

What's this blog post about?

In this article, the use and effectiveness of Bloom filters in Apache Parquet files are explored. The impact of Bloom filters on written Parquet files is measured, particularly when dealing with large quantities of high-cardinality data. Results show that moderate Bloom filter parameters (FPP of 0.01 and NDV of 1,000) yielded optimal pruning efficiency at a cost of 2 KB to 8 KB per column per row group in storage space. Query times were reduced to 1/30th of the time using Bloom filters. The chosen FPP should correspond to the amount of pruning expected from the Bloom filter, and an underestimated NDV can save storage space without affecting pruning efficiency. Experiments also demonstrated that DataFusion successfully prunes all non-matching row groups at NDV 1,000, adding only ~2K overhead per row group.

Company
InfluxData

Date published
May 28, 2024

Author(s)
Trevor Hilton

Word count
3188

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.