Company
Date Published
Author
Ben Epstein
Word count
1356
Language
English
Hacker News points
None

Summary

The authors of the blog post used a data-centric approach with Galileo to improve a Named Entity Recognition (NER) system on the MIT Movies dataset. By inspecting and analyzing errors in the training data, they were able to uncover issues such as mislabeled spans, incorrect span boundaries, and semantic overlap between classes. They filtered out high DEP score spans, relabeled corrected samples, and applied specific filters to address challenging classes like Genre and Actor. After making these corrections, they saw a 3.3 point F1-score improvement on test data, with the majority of gains coming from correcting just 4% of the training data. This demonstrates the potential of Galileo's workflow to save model iterations, GPU costs, and training time, while improving model performance in production.