Data Logging: Sampling versus Profiling

Company

WhyLabs

Date Published

Oct. 29, 2020

Author

Bernease Herman

Word count

1433

Language

English

Hacker News points

URL

whylabs.ai/blog/posts/data-logging-sampling-versus-profiling

Summary

The article discusses the importance of data logging for robust ML/AI applications. It compares two approaches to data logging - sampling and profiling. Sampling involves randomly or programmatically selecting samples of data from a larger data stream, while profiling collects statistical measurements of the data. The author argues that profiling is superior to sampling as it provides a lightweight, robust approach to characterizing distributions for all types of data encountered in ML. Profiling also captures rare events and outliers accurately, which are often correlated with data issues. The article presents whylogs - an open-source library developed by the team at WhyLabs that enables scalable, statistical data logging and profiling in only a few lines of code. It also highlights how profiles can be used for automated monitoring of ML/AI applications and pipelines due to their lightweight, controlled, simple, human-centered, and statistical nature.