Why does one NGINX worker take all the load?

Company

Cloudflare

Date Published

Oct. 23, 2017

Author

Marek Majkowski

Word count

1663

Language

English

Hacker News points

164

URL

blog.cloudflare.com/the-sad-state-of-linux-socket-balancing

Summary

The text discusses the different ways of designing a TCP server with regard to performance, focusing on three models: (a) Single listen socket, single worker process; (b) Single listen socket, multiple worker processes; and (c) Multiple worker processes, each with separate listen socket. It explains that while increasing the number of worker processes can overcome a single CPU core bottleneck, it also opens up new problems. The text then delves into the issue of spreading accept() load across multiple processes and how Linux handles this differently in both cases. Finally, it discusses how SO_REUSEPORT can be used to work around the balancing problem by splitting incoming connections into multiple separate accept queues, resulting in better load distribution. However, it also highlights that while the average is comparable, the maximum value significantly increased and most importantly the deviation is now gigantic, leading to a degraded-latency state. The text concludes by suggesting that changing the standard epoll behavior from LIFO to FIFO could be a better solution.