/plushcap/analysis/cloudflare/however-improbable-the-story-of-a-processor-bug

However improbable: The story of a processor bug

What's this blog post about?

In February 2017, Cloudflare experienced a security problem known as Cloudbleed due to a bug in their HTML parsing code. This led them to investigate all crashes more thoroughly and they discovered that some were caused by invalid memory accesses resulting in the NGINX process crashing. They used core dumps, which record the state of a terminated process, to help identify these issues. After fixing several bugs causing crashes, they noticed a residual number of "mystery core dumps" that seemed impossible based on their code. These were occurring at a rate of about one per day across their entire fleet of servers. Their investigation eventually led them to focus on the Intel Xeon E5-2650 v4 processors, which were causing internal parity errors or unpredictable system behavior due to an issue known as BDF76. Applying a microcode update to these Broadwell servers resolved the problem and reduced the rate of core dumps significantly. This allowed them to focus more effectively on other issues in their software.

Company
Cloudflare

Date published
Jan. 18, 2018

Author(s)
David Wragg

Word count
2392

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.