Usage Deduplication is ChallengingHow to avoid double counting usage data in distributed systems

Company

OpenMeter

Date Published

July 24, 2023

Author

Peter Marton

Word count

1550

Language

English

Hacker News points

None

URL

openmeter.io/blog/usage-deduplication

Summary

Usage data is crucial for businesses to bill customers accurately, identify sales opportunities, and predict revenue. Ensuring the accuracy of this data is essential, as undercharging or overcharging customers can lead to financial losses and damaged reputations. Engineers must balance consistency, latency, and cost trade-offs to deliver real-time, accurate data that powers business use cases. In distributed systems with unreliable networks, it's challenging to maintain accuracy in usage metering. The standard solution is to retry; however, deduplication becomes necessary to avoid double-counting events. Deduplicating large volumes of data and servicing accurate aggregated meters in real time requires some preparation. To achieve idempotency in usage metering, we need to establish the criteria for identifying when an event is unique by assigning a unique idempotency key. This key typically contains random and time components and can often leverage existing idempotent keys within your business logic. Uniqueness has a time component that defines the window within which an event is considered unique based on the idempotency key. Deduplication solutions include: 1. In collection time at the usage source, where deduplication would need to be stateful and look for idempotency across multiple processes. This can be efficient around network retries in distributed systems but doesn't guarantee consistency. 2. At ingestion time in the metering system, which is the most powerful option as we can filter out duplicates across multiple sources and use states to store historical idempotency keys on a longer time window. This can happen in the processing pipeline via Bloom Filters or stream processing with Kafka. 3. Before serving usage to consumers, where deduplication at query time is usually not feasible on large data sets as it puts an extensive load on your data store and results in slow queries. OpenMeter, an open-source accurate usage metering solution, ingests usage data through events using the CloudEvents specification and leverages Kafka for stream processing. It achieves event deduplication by considering the combination of id and source, with each event's occurrence counted within a deduplication window set to 32 days by default. Only events that occur for the first time within this window are further processed and incorporated into the metering.