Let AI/ML Workloads Take off with Aerospike and Spark 3.0

Company

Aerospike

Date Published

May 5, 2021

Author

Kiran Matty

Word count

1320

Language

English

Hacker News points

None

URL

aerospike.com/blog/aerospike-connect-for-spark-3-0

Summary

The text discusses the integration of AI/ML workloads with Aerospike and Spark 3.0, which is particularly relevant in light of increased digitalization due to the COVID-19 pandemic. It highlights that data science teams are dealing with larger datasets than ever before, leading to a need for efficient model training and deployment solutions. The integration of Apache Spark 3.0 and Aerospike can help achieve this goal effectively and cost-efficiently. The text also delves into the benefits of using Spark 3.0, such as its support for Kubernetes, enabling a single low latency and high-throughput pipeline from data ingest to model training on GPU-powered clusters. It mentions that Spark 3.0 can now schedule AI/ML applications on Spark clusters with GPUs, allowing the unification of CPU-based data platforms used for data preparation via ETL and largely GPU-based AI/ML platforms into a single unified data infrastructure. Additionally, the text discusses new query enhancements in Spark 3.0, such as Adaptive Query Execution (AQE), which helps when statistics on the data source do not exist or are not accurate. It also highlights the features added to Aerospike Connect for Spark, including data sampling at scale, support for set indexes and quotas of the Aerospike Database 5.6, pushdown predicates with previously unsupported operators to speed up queries, and support for highly performant Spark Data Source V2 API. The text concludes by providing two case studies where companies have successfully utilized this integration in their AI/ML initiatives. It also mentions future plans to validate the Spark connector with the RAPIDS Accelerator for Apache Spark, allowing direct feeding of Aerospike data into Deep Learning models running on GPUs.