/plushcap/analysis/incident-io/clouds-caches-and-connection-conundrums

Clouds, caches and connection conundrums

What's this blog post about?

The company recently moved its infrastructure into Google Cloud and experienced a spike in connection timeouts, particularly with Postgres and Memcache. They initially tried doubling the maximum connection lifespan and making connection pools static, which improved the situation but did not completely resolve it. They then switched to their own memcached instance running inside Kubernetes, but this also did not solve the problem entirely. Upon further investigation, they discovered that a bad query was causing thousands of duplicate network calls to an external third party, leading to increased Postgres and Memcache connection and request timeouts. The issue seemed to be related to high volume of TCP connections being opened and closed rapidly on the node due to GKE Dataplane V2 agent Pods (anetd). They mitigated this by implementing keep-alives for HTTP calls and connection pooling for relevant workloads. The company learned that they should not focus their hypotheses too quickly, and limit outbound network concurrency in the future to avoid similar issues.

Company
Incident.io

Date published
Sept. 26, 2023

Author(s)
Ben Wheatley

Word count
2052

Hacker News points
4

Language
English


By Matt Makai. 2021-2024.