Company
Date Published
Author
Makbule Gulcin Ozsoy
Word count
732
Language
English
Hacker News points
None

Summary

The Neo4j Text2Cypher (2024) Dataset is a machine learning dataset designed to help train and benchmark Text2Cypher models with ease. It was created by combining publicly available datasets, cleaning and organizing them for smoother use. The dataset consists of 44,387 instances, with 39,554 in the training split and 4,833 in the test split. The data preparation involved identifying and gathering datasets, combining and cleaning the data, creating training and test splits, and splitting remaining datasets. The resulting dataset is designed to support machine learning models that translate natural language into programming or domain-specific languages, such as turning plain text into Cypher query language.