/plushcap/analysis/hex/hex-lazy-dataframes

Optimizing Multi-Modal Analysis by Lazy Loading Dataframes

What's this blog post about?

Hex has increased execution speeds up to 10x by migrating from pandas Dataframes to a DuckDB-based architecture that directly queries Arrow data stored remotely in S3, instead of materializing dataframes into local memory. This new architecture uses lazy loading for dataframes and has seen improvements to project runtimes in the ballpark of 5-10x speedups, with some internal projects going from 30+ second runtimes to just a handful of seconds. The performance gains are most pronounced in projects that primarily use SQL and no-code cells, while projects that include a lot of Python references will see less dramatic improvements.

Company
Hex

Date published
Sept. 26, 2024

Author(s)
Dylan Scott

Word count
1661

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.