🔭 Improving Your ML Datasets, Part 2: NER

Company

Galileo

Date Published

June 7, 2022

Author

Ben Epstein

Word count

1356

Language

English

Hacker News points

None

URL

www.galileo.ai/blog/improving-your-ml-datasets-part-2-ner

Summary

The authors of the blog post used a data-centric approach with Galileo to improve a Named Entity Recognition (NER) system on the MIT Movies dataset. By inspecting and analyzing errors in the training data, they were able to uncover issues such as mislabeled spans, incorrect span boundaries, and semantic overlap between classes. They filtered out high DEP score spans, relabeled corrected samples, and applied specific filters to address challenging classes like Genre and Actor. After making these corrections, they saw a 3.3 point F1-score improvement on test data, with the majority of gains coming from correcting just 4% of the training data. This demonstrates the potential of Galileo's workflow to save model iterations, GPU costs, and training time, while improving model performance in production.