Data Quality Monitoring for Kafka, Beyond Schema Validation

Company

WhyLabs

Date Published

Aug. 23, 2022

Author

Anthony Naddeo

Word count

1824

Language

English

Hacker News points

None

URL

whylabs.ai/blog/posts/data-quality-monitoring-for-kafka-beyond-schema-validation

Summary

Data quality issues can be challenging for applications dealing with large amounts of data. Schema validation is a good start but doesn't cover all aspects of data quality. Monitoring distribution shifts, unique value ratios, and data type counts in production can help detect issues that result in "weird data." Tools like whylogs can be used to set up data quality monitoring on Kafka streams, offering lightweight statistical representations of data called profiles. These profiles can be compared, visualized, and monitored for changes, helping identify potential data quality issues early on.