/plushcap/analysis/datastax/datastax-improving-compaction-cassandra-cardinality-estimation

Improving compaction in Cassandra with cardinality estimation

What's this blog post about?

Compaction is a process in Cassandra whereby log-structured data files are merged to evict obsolete or deleted rows. The first component consulted on a read is the Bloom filter, which is a probabilistic set that saves memory by taking just a few bits per key stored. However, bloom filters are not re-sizeable and can waste space if sstables overlap. To solve this problem, Cassandra 2.1 uses HyperLogLog for cardinality estimation to get an accurate count of how much the compacting sstables overlap, saving about 40% of the bloom filter overhead. Experimentally, using cardinality estimation to pick better compaction candidates could be a significant improvement for overwrite-heavy and append-mostly workloads.

Company
DataStax

Date published
Jan. 27, 2014

Author(s)
Jonathan Ellis

Word count
660

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.