/plushcap/analysis/cloudflare/the-story-of-one-latency-spike

The story of one latency spike

What's this blog post about?

A customer reported slow HTTP responses from CloudFlare CDN servers. The issue was not easily reproducible and went unnoticed by usual monitoring systems. After investigating the problem, it was discovered that there were spikes in latency between the router and the server within their datacenter. System Tap, a debugging tool for Linux, helped identify the function causing the latency spike as tcp_collapse. The issue was resolved by adjusting the rmem sysctl to limit the receive buffer size on TCP sockets, which in turn reduced the time required for garbage collection and improved performance.

Company
Cloudflare

Date published
Nov. 19, 2015

Author(s)
Marek Majkowski

Word count
1462

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.