When Bloom filters don't bloom

Company

Cloudflare

Date Published

March 2, 2020

Author

Marek Majkowski

Word count

2386

Language

English

Hacker News points

227

URL

blog.cloudflare.com/when-bloom-filters-dont-bloom

Summary

The author discusses their experience with using Bloom filters to remove duplicate lines from a large dataset containing 1 billion lines. They initially chose Bloom filters due to their ability to handle probabilistic data structures and the fact that they only needed to remove duplicates, not sort the lines. However, they encountered performance issues when the Bloom filter did not fit into L3 cache, leading to random memory accesses which are very costly. To address this issue, the author switched to using a simple hash table with linear probing, resulting in better memory access patterns and improved performance. They also learned that modern CPUs are really good at sequential memory access but struggle with random memory access, and that advanced data structures like Bloom filters can be great as long as they fit into L3 cache. In conclusion, the author emphasizes the importance of optimizing for reduced number of loads over optimizing memory usage when working with large datasets not fitting into L3 cache, and highlights the power of profiling tools to identify performance bottlenecks.