How (And Why) To Move From Spark on YARN to Kubernetes
Apache Spark is a popular open source distributed computing framework that enables data engineers to process large amounts of data across multiple machines. It is optimized for machine learning and AI, making it valuable in batch processing tasks. Traditionally, companies have used the Java Virtual Machine (JVM)-based Hadoop YARN to manage their Spark clusters. However, with the rise of Kubernetes and cloud-native computing, many organizations are moving away from YARN to Kubernetes for managing their Spark clusters. Kubernetes offers numerous potential benefits such as scalability, open source flexibility, and compatibility with various infrastructure types. The transition from YARN to Kubernetes can provide better dependency management, resource management, and access to a rich ecosystem of integrations. Key steps in this migration include determining the complexity of jobs, evaluating data connectivity needs, analyzing compute and storage latency, and auditing monitoring and security policies. Switching to Spark on Kubernetes can yield significant benefits for data engineers, including simpler dependency and resource management, value-added integrations, and cost savings opportunities.
Company
Acceldata
Date published
Nov. 4, 2021
Author(s)
Rohit Choudhary
Word count
1078
Language
English
Hacker News points
None found.