ML Infrastructure Tools for Data Preparation
The text discusses the importance of Machine Learning (ML) Infrastructure platforms for businesses across various industries. It breaks down the ML workflow into three stages - data preparation, model building, and production. Data preparation is a crucial stage where raw data is transformed into inputs for training models. This involves sourcing data from different stores, ensuring completeness, adding labels, and transforming data to generate features. Various tools and platforms are available to assist in these tasks, such as Elastic Search, Hive, Qubole, Scale AI, Figure Eight, LabelBox, Amazon Sagemaker, Trifacta, Pixata, Alteryx, Spark, DataBricks, Domino, Databricks, Cloudera Workbench, and others. The text also highlights the challenges faced in data preparation, such as sourcing data from multiple locations, ensuring completeness, and maintaining clean data. It emphasizes the importance of tracking versioned data transformations and using feature stores to reduce duplicative work and compute costs.
Company
Arize
Date published
May 14, 2020
Author(s)
Aparna Dhinakaran
Word count
1278
Hacker News points
None found.
Language
English