Understanding Kolmogorov-Smirnov (KS) Tests for Data Drift on Profiled Data
This blog post discusses the use of statistical tests, specifically the Kolmogorov-Smirnov (KS) test, for detecting data drift in profiled datasets. Data drift is a common issue in machine learning applications that can degrade model performance if left unaddressed. The KS test is a nonparametric method used to compare two one-dimensional probability distributions and determine whether they are likely drawn from the same distribution. The post explores three experiments to better understand how different variables affect the results of the KS test when applied to data profiles: data volume, number of buckets, and profile size. The first experiment shows that increasing sample sizes increases the sensitivity of the KS test, making it more likely to detect small differences between distributions. However, this increased sensitivity may not always be desirable, as it can lead to false positives or overly conservative conclusions. The second experiment investigates how varying the number of buckets affects the accuracy of the KS test on data profiles. Results indicate that increasing the number of buckets generally reduces errors but also increases variance for larger sample sizes due to estimation errors in the profiling process. Finally, the third experiment examines the effect of profile size on error rates when using the KS test with data profiles. Results show that increasing the profile size decreases errors, making it possible to achieve results closer to the standard implementation by sacrificing some storage space. In conclusion, while performing the KS test on data profiles is feasible and produces results close to the standard implementation, there are limitations to consider. The sensitivity of the KS test increases with sample size, which may lead to false positives or overly conservative conclusions when testing under the null hypothesis. Additionally, tuning internal parameters such as profile size can improve accuracy but comes at a cost of increased storage requirements.
Company
WhyLabs
Date published
Dec. 21, 2022
Author(s)
Felipe Adachi
Word count
2693
Language
English
Hacker News points
None found.