Q&A Series: Solving Privacy Problems with Synthetic Data

Company

Gretel.ai

Date Published

March 11, 2022

Author

Lipika Ramaswamy

Word count

922

Language

English

Hacker News points

None

URL

gretel.ai/blog/q-a-series-solving-privacy-problems-with-synthetic-data

Summary

The text discusses synthetic data and its use in privacy preservation. Synthetic data is created by using algorithms that learn the distribution of a sensitive dataset and generate records from that distribution. This process differs from differential privacy, which adds noise to an aggregate to introduce uncertainty about the presence of any particular individual in the dataset. Differential privacy is used to create synthetic datasets without compromising the privacy of individuals in the original dataset. The use of machine learning (ML) for creating synthetic data can be protected against adversarial attacks by incorporating privacy-protection mechanisms like differential privacy. However, if the goal of an analysis is to study outliers or small, rare populations, it may not be useful to use privacy-preserving approaches like differential privacy. In such cases, controlling who has access to the data and mandating training on acceptable use are more suitable. When creating synthetic data, it's important to define its intended use as this can impact the algorithm used for data generation, validators used to ensure proper logic, and pre-processing of original data. The quality of synthetic data should be measured depending on its intended use, with various methods available for measuring synthetic data quality. Gretel provides a synthetic data quality report for every model built using their platform, which can help determine whether the synthetic datasets maintain statistical fidelity. Communicating the error or accuracy of synthetic data is crucial and depends on the intended use. A differentially private querying system can be useful in understanding relevant information about the original dataset in a privacy-preserving manner, following which a comparison can be made to the synthetic dataset.