Clouds, caches and connection conundrums

Company

Incident.io

Date Published

Sept. 26, 2023

Author

Ben Wheatley

Word count

2052

Language

English

Hacker News points

URL

incident.io/blog/clouds-caches-and-connection-conundrums

Summary

The company recently moved its infrastructure into Google Cloud and experienced a spike in connection timeouts, particularly with Postgres and Memcache. They initially tried doubling the maximum connection lifespan and making connection pools static, which improved the situation but did not completely resolve it. They then switched to their own memcached instance running inside Kubernetes, but this also did not solve the problem entirely. Upon further investigation, they discovered that a bad query was causing thousands of duplicate network calls to an external third party, leading to increased Postgres and Memcache connection and request timeouts. The issue seemed to be related to high volume of TCP connections being opened and closed rapidly on the node due to GKE Dataplane V2 agent Pods (anetd). They mitigated this by implementing keep-alives for HTTP calls and connection pooling for relevant workloads. The company learned that they should not focus their hypotheses too quickly, and limit outbound network concurrency in the future to avoid similar issues.