Computing Euclidean distance on 144 dimensions
The text discusses the optimization journey of an image processing pipeline at Cloudflare. It delves into various algorithms and techniques used to improve the speed of the matching algorithm in their CSAM image scanning tool. Initially, a naive quadratic algorithm was employed, but it proved to be slow. SIMD instructions were then utilized for improvement, but the performance still fell short of expectations. The Vantage Point Tree algorithm was attempted, but due to the large dimensionality of the problem, it failed to narrow down the search effectively. The authors discovered that they didn't need to fetch all data from memory and could compute only a subset of dimensions for each hash. This led to the development of a smarter brute force algorithm, which was further optimized by transposing data for better AVX usage. The final algorithm achieved significant speed improvements, reducing query times from 55ms to just 0.73ms. The text also highlights the challenges faced in making technologies work at Cloudflare scale and emphasizes the need for continuous optimization efforts.
Company
Cloudflare
Date published
Dec. 18, 2020
Author(s)
Marek Majkowski
Word count
2437
Hacker News points
6
Language
English