/plushcap/analysis/datadog/engineering-highly-reliable-data-pipelines

Building Highly Reliable Data Pipelines at Datadog

What's this blog post about?

At Datadog, ensuring reliable data pipelines is crucial due to the massive volume of data processed daily. Reliability is defined as a system's ability to produce correct outputs up to a given time. A highly reliable pipeline doesn't necessarily mean it never fails; instead, it consistently delivers data on time even if it crashes occasionally. To guarantee reliability, several factors must be considered when designing pipelines: fault tolerance, good monitoring, and preparedness for failure recovery. Datadog uses a simplified architecture consisting of an object store for historical data, clusters running Spark data pipelines, Luigi workers for task and workflow management, and Spark workers for compiling code and sending it to the appropriate cluster. Datadog's approach involves using separate clusters for each job instead of one giant cluster, which provides isolation between jobs, easier monitoring, customized tuning for specific jobs, and easy scaling up or down as needed. The use of spot instances in AWS also forces fault tolerance design. To avoid long-running jobs that can lead to significant work loss during failures, pipelines should be broken vertically by separating transformations into multiple jobs with intermediate data checkpoints and horizontally by partitioning input data and running multiple jobs to process the whole dataset. Monitoring is essential for early detection of failures, using system metrics, job metrics, and data latency metrics. In case of failure, quick recovery and minimizing customer-facing impact are crucial goals. This can be achieved through breaking down jobs into smaller pieces, increasing cluster size, switching from spot to on-demand clusters, and establishing easy ways to rerun jobs. Additionally, having backup plans for query systems in case of pipeline delays helps maintain service operability at the cost of slightly degraded performance.

Company
Datadog

Date published
April 2, 2019

Author(s)
Quentin Francois

Word count
2035

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.