How we made querying Pandas DataFrames with chDB 87x faster
ClickHouse Cloud offers $300 in credits and volume-based discounts for users interested in its services. The author has been working on an embedded version of ClickHouse called chDB, which runs in-process, for almost two years. In this blog post, the author shares performance improvements made to chDB over the last few months. Initially, chDB was designed with simplicity in mind and required serialization, deserialization, and memory copying, leading to unsatisfactory performance. To improve efficiency, a Python table engine was introduced, allowing users to run SQL on DataFrame variables as tables. The chDB Python Table Engine is relatively simple but faced challenges related to the Global Interpreter Lock (GIL) and Object Reference Counting in CPython. Performance optimizations were made by reducing overhead, minimizing CPython API function calls, batching data copying, and rewriting Python string encoding and decoding logic in C++. These improvements led to a significant performance leap for Q23, reducing the time from 8.6 seconds to 0.56 seconds—a 15x improvement. Comparisons with DuckDB showed that chDB outperforms it when querying DataFrames containing 10 million rows of ClickBench data. The latest version of chDB, v2.0.2, introduces a mechanism for users to define their own table-returning logic using Python and includes an API for querying APIs that return JSON arrays as ClickHouse tables.
Company
ClickHouse
Date published
Aug. 29, 2024
Author(s)
Auxten Wang
Word count
1707
Language
English
Hacker News points
3