Company
Date Published
Author
Paul Dix
Word count
2528
Language
English
Hacker News points
3

Summary

Apache Arrow, Parquet, Flight, and their ecosystem are a game-changer for OLAP (Online Analytical Processing) systems by providing interoperability between projects, massive performance gains for pushing and pulling data in and out of big data systems, and a common data API. The Arrow specification is an in-memory columnar data format designed to take advantage of modern CPU architectures, while Parquet is a compressed on-disk data format that has become ubiquitous as an accepted format for input and output within big data systems. Flight is a framework for fast data transport that defines a gRPC API for shipping Arrow Array data wrapped in FlatBuffers to describe metadata like schema, dictionaries, and breaks between record batches. The ecosystem offers tools for querying fast, in-memory data with great performance, in-process in the language of choice, and potential integrations with other systems to support tasks like data science and distributed query execution. Despite some early stages in maturity, these projects represent a set of tools that can be used to build interoperable components for data science, analytics, and working with data at scale, potentially disrupting the OLAP ecosystem with increased interoperability and new design requirements for systems to run in environments like Kubernetes.