Bias in Data: What Embeddings Reveal About Real vs Synthetic Data Distribution

Company

Voxel51

Date Published

Jan. 7, 2025

Author

Manushree Gangwar

Word count

2132

Language

English

Hacker News points

None

URL

voxel51.com/blog/bias-in-data-what-embeddings-reveal-about-real-vs-synthetic-data-distribution

Summary

The text discusses biases in human vision, particularly in machine learning model performance. It highlights how visual perception can be influenced by assumptions about the source of illumination and the Thatcher effect, where it's difficult to detect distortions of facial features when faces are upside-down. The text also explores the use of synthetic data to offset biases in real-world datasets and discusses challenges associated with generating high-quality synthetic data. It uses FiftyOne, a platform for machine learning model training and evaluation, to compare complex features in the embedding space for datasets combining real and synthetic images. The results show that synthetic data can introduce bias in certain cases, but it can also be used to improve model performance by reducing biases in real-world datasets. The text concludes that managing the complexities of data distribution is crucial when using synthetic data to reduce bias in machine learning models.