On Leaky Datasets and a Clever Horse
There was a horse named Hans that was trained to perform arithmetic by Wilhelm von Osten, but it turned out that Hans was not actually learning math, but rather responding to the questioner's body language cues. This experience is analogous to machine learning models, which can learn simple explanations for data and make predictions based on correlations, but may not generalize well to new, unseen data if there are leaks in the training dataset. Data leakage occurs when the test set contains samples that are very similar or identical to those seen during training, leading to an overly optimistic estimate of model performance. This issue can be addressed by using techniques such as removing superfluous information from the training and testing datasets, using leaky splits, and verifying the integrity of the dataset. The problem of data leakage is common in machine learning, even in well-established and widely used datasets like ImageNet, and can lead to incorrect conclusions about model performance.
Company
Voxel51
Date published
Dec. 10, 2024
Author(s)
Jacob Sela
Word count
1747
Language
English
Hacker News points
None found.