Optimizing Multi-Modal Analysis by Lazy Loading Dataframes
Hex has increased execution speeds up to 10x by migrating from pandas Dataframes to a DuckDB-based architecture that directly queries Arrow data stored remotely in S3, instead of materializing dataframes into local memory. This new architecture uses lazy loading for dataframes and has seen improvements to project runtimes in the ballpark of 5-10x speedups, with some internal projects going from 30+ second runtimes to just a handful of seconds. The performance gains are most pronounced in projects that primarily use SQL and no-code cells, while projects that include a lot of Python references will see less dramatic improvements.
Company
Hex
Date published
Sept. 26, 2024
Author(s)
Dylan Scott
Word count
1661
Language
English
Hacker News points
None found.