/plushcap/analysis/cloudflare/computing-euclidean-distance-on-144-dimensions

Computing Euclidean distance on 144 dimensions

What's this blog post about?

The text discusses the optimization journey of an image processing pipeline at Cloudflare. It delves into various algorithms and techniques used to improve the speed of the matching algorithm in their CSAM image scanning tool. Initially, a naive quadratic algorithm was employed, but it proved to be slow. SIMD instructions were then utilized for improvement, but the performance still fell short of expectations. The Vantage Point Tree algorithm was attempted, but due to the large dimensionality of the problem, it failed to narrow down the search effectively. The authors discovered that they didn't need to fetch all data from memory and could compute only a subset of dimensions for each hash. This led to the development of a smarter brute force algorithm, which was further optimized by transposing data for better AVX usage. The final algorithm achieved significant speed improvements, reducing query times from 55ms to just 0.73ms. The text also highlights the challenges faced in making technologies work at Cloudflare scale and emphasizes the need for continuous optimization efforts.

Company
Cloudflare

Date published
Dec. 18, 2020

Author(s)
Marek Majkowski

Word count
2437

Hacker News points
6

Language
English


By Matt Makai. 2021-2024.