Improving compaction in Cassandra with cardinality estimation

Company

DataStax

Date Published

Jan. 27, 2014

Author

Jonathan Ellis

Word count

660

Language

English

Hacker News points

None

URL

www.datastax.com/blog/improving-compaction-cassandra-cardinality-estimation

Summary

Compaction is a process in Cassandra whereby log-structured data files are merged to evict obsolete or deleted rows. The first component consulted on a read is the Bloom filter, which is a probabilistic set that saves memory by taking just a few bits per key stored. However, bloom filters are not re-sizeable and can waste space if sstables overlap. To solve this problem, Cassandra 2.1 uses HyperLogLog for cardinality estimation to get an accurate count of how much the compacting sstables overlap, saving about 40% of the bloom filter overhead. Experimentally, using cardinality estimation to pick better compaction candidates could be a significant improvement for overwrite-heavy and append-mostly workloads.