Company
Date Published
Author
Ben Epstein
Word count
4324
Language
English
Hacker News points
None

Summary

At Galileo, the team aimed to enable machine learning engineers to surface critical issues in their datasets efficiently and quickly. They faced a unique challenge due to the large size of their data, which was not addressed by other model-centric ML tools. To address this, they explored various options, including using out-of-core data handling tools like pandas with chunking and pyarrow. However, these approaches proved insufficient or too slow for their needs. The team then turned to Vaex, a little-known project that offers an out-of-core hybrid Apache Arrow/NumPy DataFrame solution for big tabular data. Vaex provided the necessary scalability, performance, and flexibility to handle massive datasets with low latency. By leveraging its memory-mappable file format, chunking, and caching capabilities, the team was able to efficiently process large amounts of data, including single columns of high-dimensional vectors, and apply custom NumPy expressions using the `register_function` decorator. Vaex's performance and scalability were significantly better than other tools in handling high-dimensional data, making it an ideal choice for their platform.