Webinar: Streaming Big Data with Spark, Spark Streaming, Kafka, Cassandra and Akka

Company

DataStax

Date Published

Jan. 21, 2015

Author

Helena Edelson

Word count

1201

Language

English

Hacker News points

None

URL

www.datastax.com/blog/webinar-streaming-big-data-spark-spark-streaming-kafka-cassandra-and-akka

Summary

The presenter, Helena Edelson, discussed Spark, Spark Streaming, Cassandra with Kafka and Akka in a webinar attended by over 1700 participants from around the world. She highlighted how these technologies are suitable for lambda architecture due to their shared features and strategies. The integration of these tools can create a streaming data platform for real-time delivery of meaningful information at high velocity, within a highly distributed, asynchronous, parallel, fault-tolerant system. The Q&A session covered various topics such as the speed of Hive or Spark when using BYOH (with Spark being faster), memory requirements for Spark cluster to query HDFS data, status of Spark R (Alpha), user friendliness of the solution (API requires technical knowledge/training but a UI is planned), implementation of moving/sliding windows with Spark Streaming (possible as it's built-in), use of Spark Streaming from HDFS (yes, through several operations for streaming from and to any HDFS-compatible sources and sinks), elasticity of scale in Cassandra (easier with VNodes enabled but not required), real-time data calculation capabilities of the solution (good due to Spark's in-memory computing), availability of Spark Cassandra Connector (in DSE 4.5 and greater, publicly available from 4.6 onwards), handling of splits by Cassandra with VNodes on (a single spark partition queries data residing on a single Cassandra node), common deployment patterns for Kafka -> Spark -> Cassandra (co-located nodes for Spark and Cassandra, separate nodes for Kafka, local Kafka clusters in each DataCenter recommended), Kafka's management of consumer overwhelm (consumers are responsible for fetching data from Brokers and managing their own state), differences between Spark, Logstash, and Kibana (Spark is a cluster computing framework for large-scale data processing, while Logstash and Kibana are tools for aggregating and processing text logs and providing UI respectively), interaction with JDBC datasources by Spark (via JdbcRDD), handling of transactions in Cassandra (through PAXOS to implement a quorum based "lightweight" transactional system), use of any column in a WHERE clause in SparkSQL (yes, all columns are supported now), and overcoming flushwriter blocking issues using this methodology (by reducing cores, not using batch size in rows = 1).