Querying Parquet with Millisecond Latency

Company

InfluxData

Date Published

Dec. 7, 2022

Author

Raphael Taylor-Davies

Word count

3102

Language

English

Hacker News points

URL

www.influxdata.com/blog/querying-parquet-millisecond-latency

Summary

In this article, Raphael Taylor-Davies and Andrew Lamb explain several advanced techniques for querying data stored in Apache Parquet files quickly. They implemented these techniques in the Apache Arrow Rust Parquet reader, which is one of the fastest implementations for querying Parquet files on local disk or remote object storage. The authors discuss various optimization techniques such as vectorized decoding, streaming decode, dictionary preservation, projection pushdown, predicate pushdown, row group pruning, page pruning, and late materialization. They also explain how to use these techniques to optimize I/O and CPU usage when reading Parquet files from commodity blob storage systems like AWS S3. The authors conclude that while implementing these optimizations requires significant engineering effort, the benefits can be substantial, especially for large-scale data analytics workloads.