Cassandra File System Design

Company

DataStax

Date Published

Feb. 11, 2012

Author

Jake Luciani

Word count

767

Language

English

Hacker News points

None

URL

www.datastax.com/blog/cassandra-file-system-design

Summary

The Cassandra File System (CFS) is an HDFS compatible filesystem designed to replace traditional Hadoop NameNode, Secondary NameNode and DataNode daemons. It simplifies operational overhead by removing single points of failure in the Hadoop NameNode and offers easy Hadoop integration for Cassandra users. CFS is modeled as a Keyspace with two Column Families in Cassandra: "inode" and "sblocks". The "inode" column family contains meta information about a file, while the "sblocks" column family stores the actual contents of the file. Meta information includes filename, parent path, user, group, permissions, filetype, and a list of block IDs that make up the file. CFS splits a block into sub-blocks since it relies on Thrift, which does not support streaming, to prevent overloading the node with large amounts of data at once. When a read comes in for a file or part of a file, CFS executes a custom Thrift call that returns either the specified sub-block data or, if the call was made on a node with the data locally, the file and offset information of the Cassandra SSTable file with the subblock. This approach cuts down network traffic between nodes by compressing and decompressing sub-blocks on the client side.