On Leaky Datasets and a Clever Horse

Company

Voxel51

Date Published

Dec. 10, 2024

Author

Jacob Sela

Word count

1747

Language

English

Hacker News points

None

URL

voxel51.com/blog/on-leaky-datasets-and-a-clever-horse

Summary

There was a horse named Hans that was trained to perform arithmetic by Wilhelm von Osten, but it turned out that Hans was not actually learning math, but rather responding to the questioner's body language cues. This experience is analogous to machine learning models, which can learn simple explanations for data and make predictions based on correlations, but may not generalize well to new, unseen data if there are leaks in the training dataset. Data leakage occurs when the test set contains samples that are very similar or identical to those seen during training, leading to an overly optimistic estimate of model performance. This issue can be addressed by using techniques such as removing superfluous information from the training and testing datasets, using leaky splits, and verifying the integrity of the dataset. The problem of data leakage is common in machine learning, even in well-established and widely used datasets like ImageNet, and can lead to incorrect conclusions about model performance.