/plushcap/analysis/plaid/plaid-elasticsearch-architecture-reliability

Plaid's journey to a multi-cluster Elasticsearch architecture to improve reliability

What's this blog post about?

Plaid has transitioned to a multi-cluster Elasticsearch architecture to improve the reliability of its logging system. The company's previous single 120-node cluster was struggling under the weight of over 50,000 logs per second across thousands of Kubernetes containers. This resulted in frequent delays and lost logs, impacting both engineers and customer support operations. The new architecture involves a multi-cluster Elasticsearch setup with per-service log streams, allowing for better isolation of critical services and more flexible management of the load across hundreds of internal services that emit logs. Using cross-cluster search with a gateway cluster, Plaid maintained a single front-end for log queries and dashboards while seeing no impact on query performance. Since launching this new architecture, log delay has reduced by over 95%, providing customers with reliable and instant access to insights. The observability team is now able to sleep more peacefully at night. This post covers the decision-making process, details of the new architecture and migration, and lessons learned since the migration.

Company
Plaid

Date published
May 29, 2024

Author(s)
Authors: Ben Masschelein-Rodgers, Santi Santichaivekin, Max Zheng, Will Yu, and Brady Wang

Word count
2153

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.