How (And Why) To Move From Spark on YARN to Kubernetes

Company

Acceldata

Date Published

Nov. 4, 2021

Author

Rohit Choudhary

Word count

1078

Language

English

Hacker News points

None

URL

www.acceldata.io/blog/why-move-from-spark-on-yarn-to-kubernetes

Summary

Apache Spark is a popular open source distributed computing framework that enables data engineers to process large amounts of data across multiple machines. It is optimized for machine learning and AI, making it valuable in batch processing tasks. Traditionally, companies have used the Java Virtual Machine (JVM)-based Hadoop YARN to manage their Spark clusters. However, with the rise of Kubernetes and cloud-native computing, many organizations are moving away from YARN to Kubernetes for managing their Spark clusters. Kubernetes offers numerous potential benefits such as scalability, open source flexibility, and compatibility with various infrastructure types. The transition from YARN to Kubernetes can provide better dependency management, resource management, and access to a rich ecosystem of integrations. Key steps in this migration include determining the complexity of jobs, evaluating data connectivity needs, analyzing compute and storage latency, and auditing monitoring and security policies. Switching to Spark on Kubernetes can yield significant benefits for data engineers, including simpler dependency and resource management, value-added integrations, and cost savings opportunities.