/plushcap/analysis/influxdata/how-good-parquet-wide-tables

How Good is Parquet for Wide Tables (Machine Learning Workloads) Really?

What's this blog post about?

The blog post discusses the performance of Apache Parquet files in storing wide tables with thousands of columns, particularly focusing on machine learning workloads. It highlights that while concerns about Parquet metadata are valid, the actual overhead is smaller than generally recognized. By optimizing writer settings and simple implementation tweaks, the overhead can be reduced by 30-40%. The post also mentions that significant additional implementation optimization could improve decode speeds by up to 4x. It concludes that software engineering efforts focused on improving the efficiency of Thrift decoding and Thrift to parquet-rs struct transformation will directly translate to improving overall metadata decode speed.

Company
InfluxData

Date published
June 18, 2024

Author(s)
Xiangpeng Hao

Word count
1939

Language
English

Hacker News points
None found.


By Matt Makai. 2021-2024.