ClickHouse and the MTA Data Challenge
The Metropolitan Transportation Authority (MTA) has launched an Open Data Challenge for developers and data enthusiasts to create projects using MTA datasets. One of the largest datasets available is the turnstile dataset, which contains information on entry/exit values for turnstiles in New York City over several years. ClickHouse, an OLAP database designed for scale, has made this dataset available in their new playground where users can query the data for free. The text provides a detailed guide on how to load and clean the MTA transit dataset using ClickHouse, including schema improvements, handling cumulative values and outliers, and dealing with missing or inconsistent station names.
Company
ClickHouse
Date published
Oct. 24, 2024
Author(s)
The PME Team
Word count
3433
Language
English
Hacker News points
2