Streaming in Cassandra 2.0
Streaming in Apache Cassandra is a crucial component responsible for data exchange among nodes within a cluster. It plays a significant role during operations like bootstrapping, repairing, and bulk loading of data. However, tracing the cause of slow or stuck streaming has been challenging. To address this issue, the streaming protocol and API were redesigned in C* version 2.0 to improve reliability, traceability, and speed. The new Streaming 2.0 design associates all streaming sessions related to an operation (e.g., bulk load, move, bootstrap) with a single Stream Plan, making it easier to track the ongoing streams in one place using nodetool netstats. Each Stream Plan has its unique Stream Plan ID, which can be used for log tracing and monitoring through JMX interface or by building custom streaming monitoring applications. Streaming 2.0 also introduced pipelining on the same connection, eliminating the need for senders to wait for ACKs before transferring subsequent files. Although there is room for further performance improvements, this redesign allows for future enhancements like supporting streaming of older version SSTable files during C* upgrades. The Streaming API can be found under the package org.apache.cassandra.streaming, and more information about its design is available on the Wiki.
Company
DataStax
Date published
Sept. 4, 2013
Author(s)
Yuki Morishita
Word count
516
Language
English
Hacker News points
None found.