How Good is Parquet for Wide Tables (Machine Learning Workloads) Really?

Company

InfluxData

Date Published

June 18, 2024

Author

Xiangpeng Hao

Word count

1939

Language

English

Hacker News points

None

URL

www.influxdata.com/blog/how-good-parquet-wide-tables

Summary

The blog post discusses the performance of Apache Parquet files in storing wide tables with thousands of columns, particularly focusing on machine learning workloads. It highlights that while concerns about Parquet metadata are valid, the actual overhead is smaller than generally recognized. By optimizing writer settings and simple implementation tweaks, the overhead can be reduced by 30-40%. The post also mentions that significant additional implementation optimization could improve decode speeds by up to 4x. It concludes that software engineering efforts focused on improving the efficiency of Thrift decoding and Thrift to parquet-rs struct transformation will directly translate to improving overall metadata decode speed.