When Bloom filters don't bloom
The author discusses their experience with using Bloom filters to remove duplicate lines from a large dataset containing 1 billion lines. They initially chose Bloom filters due to their ability to handle probabilistic data structures and the fact that they only needed to remove duplicates, not sort the lines. However, they encountered performance issues when the Bloom filter did not fit into L3 cache, leading to random memory accesses which are very costly. To address this issue, the author switched to using a simple hash table with linear probing, resulting in better memory access patterns and improved performance. They also learned that modern CPUs are really good at sequential memory access but struggle with random memory access, and that advanced data structures like Bloom filters can be great as long as they fit into L3 cache. In conclusion, the author emphasizes the importance of optimizing for reduced number of loads over optimizing memory usage when working with large datasets not fitting into L3 cache, and highlights the power of profiling tools to identify performance bottlenecks.
Company
Cloudflare
Date published
March 2, 2020
Author(s)
Marek Majkowski
Word count
2386
Language
English
Hacker News points
227