Company
Date Published
Author
Charles Mahler
Word count
1191
Language
English
Hacker News points
None

Summary

Apache Parquet is an open-source column-oriented storage format designed to improve performance and reduce costs in data storage and processing. It was developed by Twitter and Cloudera and has become a common interchange format for various projects, making it easier for users to import and export data with minimal disruption to their workflow. Parquet's innovative techniques, such as run-length and dictionary encoding, record shredding and assembly, and rich metadata, provide great performance by reducing the size of data on disk through compression and making reads faster for analytics queries. The format has been adopted by several companies and projects, including Hadoop, Apache Iceberg, Delta Lake, Apache Spark, and InfluxDB's IOx, which rely on Parquet to optimize their architecture and reduce storage costs, especially in the age of cloud computing where data processing and scanning are charged based on usage.