Building a Real-Time Streaming ETL Pipeline in 20 Minutes

Company

Confluent

Date Published

June 23, 2017

Author

Lucia Cerchie, Yeva Byzek, Josep Prat

Word count

1966

Language

English

Hacker News points

None

URL

www.confluent.io/blog/building-real-time-streaming-etl-pipeline-20-minutes

Summary

The traditional ETL (Extract, Transform, Load) paradigm is being replaced by distributed systems and event-driven applications in modern enterprises. Businesses now process data in real time and at scale, treating data as a first-class citizen. Apache Kafka® has emerged as the core of these modern architectures, providing connectors for extracting data from different sources, a rich API for complex transformations and analysis, and more connectors for loading transformed data to another system. The end-to-end reference architecture includes Confluent Schema Registry for managing schemas, validating compatibility, and ensuring data conformity. This blog post demonstrates how easily streaming ETL pipelines can be implemented in Apache Kafka using the JDBC connector, Single Message Transform (SMT) functions, and the Kafka Streams API. The workflow includes extracting data from a SQLite3 database, transforming it into key/value pairs, and loading it to a Kafka topic for real-time stream processing. Finally, the transformed data can be written to another system using Kafka sink connectors.