Large Scale Data Profiling with whylogs and Fugue on Spark, Ray or Dask
The article discusses the use of whylogs, an open-source data logging framework, for large scale data profiling. It explains how whylogs can be used to profile data with minimal overhead and highlights its mergability property which allows profiles of smaller pieces of a DataFrame to be added together to form a global dataset profile. The article also discusses the integration of whylogs with Fugue, an open-source project that brings Python, Pandas, or SQL code to Spark, Dask, or Ray. It explains how this integration allows users to maintain the same simple interface to generate profiles and scale data logging to big data frameworks like Spark. The article concludes by discussing various use cases of data profiling such as anomaly detection, drift detection, and data quality problems.
Company
WhyLabs
Date published
Oct. 13, 2022
Author(s)
WhyLabs Team
Word count
1295
Language
English
Hacker News points
None found.