/plushcap/analysis/hex/scaling-hierarchical-clustering

Scaling Hierarchical Clustering

What's this blog post about?

This text discusses the challenges and solutions in scaling hierarchical clustering algorithms due to their computational, memory, and quality concerns. It explains how hierarchical clustering works, its types (agglomerative vs divisive), and the distance metrics used. The scalability issues are highlighted with a focus on the cubic growth rate of time complexity in naive agglomerative clustering. Various strategies to address these challenges are presented, including working with representative subsets, approximation algorithms like Minimum Spanning Tree (MST)-based methods, divide and conquer strategy, dimensionality reduction techniques, and tools/frameworks such as Fastcluster, Apache Spark MLlib, HDBSCAN, Dask, and RAPIDS cuML. The text concludes by emphasizing the delicate balance between efficiency and cluster quality in hierarchical clustering and the need for analysts to make informed choices while using these algorithms.

Company
Hex

Date published
Oct. 24, 2023

Author(s)
Andrew Tate

Word count
2328

Language
English

Hacker News points
None found.