How We Scaled Data Quality at Galileo

Company

Galileo

Date Published

Dec. 8, 2022

Author

Ben Epstein

Word count

4324

Language

English

Hacker News points

None

URL

galileo.ai/blog/scaled-data-quality

Summary

At Galileo, the team aimed to enable machine learning engineers to surface critical issues in their datasets efficiently and quickly. They faced a unique challenge due to the large size of their data, which was not addressed by other model-centric ML tools. To address this, they explored various options, including using out-of-core data handling tools like pandas with chunking and pyarrow. However, these approaches proved insufficient or too slow for their needs. The team then turned to Vaex, a little-known project that offers an out-of-core hybrid Apache Arrow/NumPy DataFrame solution for big tabular data. Vaex provided the necessary scalability, performance, and flexibility to handle massive datasets with low latency. By leveraging its memory-mappable file format, chunking, and caching capabilities, the team was able to efficiently process large amounts of data, including single columns of high-dimensional vectors, and apply custom NumPy expressions using the `register_function` decorator. Vaex's performance and scalability were significantly better than other tools in handling high-dimensional data, making it an ideal choice for their platform.