Introducing the Neo4j Text2Cypher (2024) Dataset

Company

Neo4j

Date Published

Nov. 7, 2024

Author

Makbule Gulcin Ozsoy

Word count

732

Language

English

Hacker News points

None

URL

neo4j.com/blog/developer/introducing-neo4j-text2cypher-dataset

Summary

The Neo4j Text2Cypher (2024) Dataset is a machine learning dataset designed to help train and benchmark Text2Cypher models with ease. It was created by combining publicly available datasets, cleaning and organizing them for smoother use. The dataset consists of 44,387 instances, with 39,554 in the training split and 4,833 in the test split. The data preparation involved identifying and gathering datasets, combining and cleaning the data, creating training and test splits, and splitting remaining datasets. The resulting dataset is designed to support machine learning models that translate natural language into programming or domain-specific languages, such as turning plain text into Cypher query language.