How dlt uses Apache Arrow

Company

dltHub

Date Published

July 11, 2024

Author

Jorrit Sandbrink

Word count

1654

Language

English

Hacker News points

None

URL

dlthub.com/blog/how-dlt-uses-apache-arrow

Summary

The author of this text is an Open Source Software Engineer at Dlt, a Python library that lets you build data pipelines as code. The library uses Apache Arrow to make pipelines faster by representing tabular data in memory more efficiently. The Arrow format is better than native Python objects (list of dictionaries) because it enables offloading computation to Arrow's fast C++ library and prevents processing rows one by one. The author explains how dlt works at a high level, including the three main steps: extract, normalize, and load. They also describe two pipeline "routes" - traditional and Arrow - which differ in how they represent tabular data in memory and persist it to disk. The Arrow route is faster because it uses schema-aware pyarrow objects that can be processed concurrently in C++. The author concludes that using Arrow improves performance significantly, especially in the normalize step, where it can process batches of values concurrently.