Improving compaction in Cassandra with cardinality estimation
Compaction is a process in Cassandra whereby log-structured data files are merged to evict obsolete or deleted rows. The first component consulted on a read is the Bloom filter, which is a probabilistic set that saves memory by taking just a few bits per key stored. However, bloom filters are not re-sizeable and can waste space if sstables overlap. To solve this problem, Cassandra 2.1 uses HyperLogLog for cardinality estimation to get an accurate count of how much the compacting sstables overlap, saving about 40% of the bloom filter overhead. Experimentally, using cardinality estimation to pick better compaction candidates could be a significant improvement for overwrite-heavy and append-mostly workloads.
Company
DataStax
Date published
Jan. 27, 2014
Author(s)
Jonathan Ellis
Word count
660
Hacker News points
None found.
Language
English