Branch predictor: How many "if"s are too many? Including x86 and M1 benchmarks!
The text discusses the cost of branch instructions in computer code, specifically unconditional jumps (jmp), conditional jumps (je), function calls (call) and returns (ret). It explains how modern CPUs use a Branch Prediction Unit (BPU) to predict the target of a branching instruction early in the CPU pipeline. The BPU maintains a Branch Target Buffer (BTB) that stores the history of previously taken branches, allowing it to predict future branches with high confidence. The text presents an experiment where a sequence of unconditional jmp instructions is run on different CPUs to measure their performance under varying conditions. It finds that the cost of each branch increases as more branches are added, and there's a significant drop in performance after a certain number of branches (4096 for most CPUs). This suggests that having too many branches in the hot loop can lead to poor performance due to BTB overflows. The text also notes that conditional branches never-taken are essentially free, and function calls (call/ret) need a BTB entry for best performance. It recommends keeping the number of function calls under 2048 in the hot code for optimal performance on some CPUs. Finally, it compares the results on different CPUs including AMD EPYC, Intel Xeon and Apple Silicon M1, finding that while there are differences in details, the general trend is consistent across all CPUs: having too many branches can lead to poor performance due to BTB overflows.
Company
Cloudflare
Date published
May 6, 2021
Author(s)
Marek Majkowski
Word count
3630
Hacker News points
324
Language
English