Company
Date Published
Sept. 26, 2023
Author
Ben Wheatley
Word count
2052
Language
English
Hacker News points
4

Summary

The company recently moved its infrastructure into Google Cloud and experienced a spike in connection timeouts, particularly with Postgres and Memcache. They initially tried doubling the maximum connection lifespan and making connection pools static, which improved the situation but did not completely resolve it. They then switched to their own memcached instance running inside Kubernetes, but this also did not solve the problem entirely. Upon further investigation, they discovered that a bad query was causing thousands of duplicate network calls to an external third party, leading to increased Postgres and Memcache connection and request timeouts. The issue seemed to be related to high volume of TCP connections being opened and closed rapidly on the node due to GKE Dataplane V2 agent Pods (anetd). They mitigated this by implementing keep-alives for HTTP calls and connection pooling for relevant workloads. The company learned that they should not focus their hypotheses too quickly, and limit outbound network concurrency in the future to avoid similar issues.