Cleanlab

Founded in 2021. Privately Held.

External links: homepage | docs | blog | jobs | youtube | twitter | github | linkedin

Data quality issue identification and remediation.

Blog posts published by month since the start of

54 total blog posts published.

Switch to word count

Blog content

post title author published words HN
CleanVision: Audit your Image Data for better Computer Vision Sanjana Garg, Ulyana Tkachenko, Yiming Chen, Elías Snorrason, Jonas Mueller Mar. 22, 2023 1729 4
Assessing the Quality of Synthetic Data with Cleanlab Studio Elías Snorrason Jul. 12, 2023 2176 2
Overcoming Hallucinations with the Trustworthy Language Model Anish Athalye, Jonas Mueller, Curtis Northcutt, Hui Wen Goh, Ulyana Tkachenko Apr. 25, 2024 4782 2
Letter from the CEO: Announcing our Series A and Cleanlab's Trustworthy Language Model Curtis Northcutt Oct. 10, 2023 742 -
Detecting Dataset Drift and Non-IID Sampling: A k-Nearest Neighbors approach that works for Image/Text/Audio/Numeric Data Jesse Cummings, Elías Snorrason, Jonas Mueller May. 30, 2023 2203 4
Detecting Label Errors in Entity Recognition Data Wei-Chen (Eric) Wang, Elías Snorrason, Jonas Mueller Oct. 12, 2022 1066 -
Effectively Annotate Text Data for Transformers via Active Learning + Re-labeling Chris Mauck May. 22, 2023 1802 -
Training Transformer Networks in Scikit-Learn?! Hui Wen Goh Mar. 08, 2023 1677 4
Improving any OpenAI Language Model by Systematically Improving its Data Chris Mauck, Jonas Mueller Jun. 01, 2023 1898 -
Ensuring Reliable Few-Shot Prompt Selection for LLMs Chris Mauck, Jonas Mueller Aug. 15, 2023 1678 3
How To Train and Deploy Reliable Models on Messy Real-World Data With a Few Clicks Hui Wen Goh, Jonas Mueller, Anish Athalye Jul. 24, 2023 1518 5
Detecting Annotation Errors in Semantic Segmentation Data Vedang Lad, Jonas Mueller Nov. 02, 2023 845 1
cleanlab 2.1 adds Multi-Annotator Analysis and Outlier Detection: toward a broad framework for Data-Centric AI Curtis Northcutt, Jonas Mueller Sep. 21, 2022 974 -
Comparing tools for Data Science, Data Quality, Data Annotation, and AI/ML Jonas Mueller Feb. 09, 2024 1916 -
Automatically Detect Problematic Content in any Text Dataset Hui Wen Goh Dec. 19, 2023 1220 -
Announcing Auto-Labeling Agent: Your Assistant for Rapid and High Quality Labeling Emily Barry Jul. 17, 2024 776 -
Finding Label Issues in Image Classification Datasets Wei Jing Lok, Jonas Mueller Apr. 21, 2022 1696 -
The Stanford Cars Dataset aka Cars196 (cited in 1000+ papers) contains many Fine-Grained Errors Chris Mauck May. 24, 2023 592 -
Reduce Legal Discovery Work by 10x with AI that Curates Documents and Fixes Errors Chris Mauck Aug. 03, 2023 1356 2
Whisking Away Errors: How Cleanlab Studio Served Up Fixes for the Food-101N Computer Vision Dataset Chris Mauck Sep. 11, 2023 546 -
cleanlab 2.3 adds support for Active Learning, Tensorflow/Keras models made sklearn-compatible, and highly scalable Label Error Detection Jonas Mueller Mar. 01, 2023 1045 -
How to detect bad data in your instruction tuning dataset (for better LLM fine-tuning) Jimming He, Sanjana Garg, Jonas Mueller Feb. 07, 2024 2278 -
Use Cleanlab to Improve LLMs: Find Errors in Human Feedback in the Anthropic RLHF Dataset Chris Mauck, Jonas Mueller Apr. 11, 2023 351 -
An open-source platform to catch all sorts of issues in all sorts of datasets Elías Snorrason, Jonas Mueller Feb. 21, 2024 1082 -
ActiveLab: Active Learning with Data Re-Labeling Hui Wen Goh, Jonas Mueller Mar. 02, 2023 1720 4
Enhancing Product Analytics and E-commerce with Data-Centric AI Sanjana Garg Jul. 06, 2023 1484 2
The Fashion MNIST Dataset (cited in 2,200+ papers) contains Hundreds of Miscategorized Items Ganesh Tata, Chris Mauck Jun. 09, 2023 446 -
Don’t Let Your Messy Documents Run You RAG-Ged. Announcing Document Curation in Cleanlab Studio Emily Barry Jun. 07, 2024 311 -
Automated Correction of Satellite Imagery Data Chris Mauck, Aditya Thyagarajan Sep. 20, 2023 673 2
Ensure high-quality data quickly via AI validation of which data is Well Labeled Ulyana Tkachenko, Jonas Mueller Aug. 28, 2023 1544 -
Letter from the CEO: Announcing Our Seed Funding and the Launch of Cleanlab Studio for Enterprise Curtis Northcutt Jul. 20, 2023 1074 -
Detecting Errors in Numerical Data via any Regression Model Jonas Mueller, Mayank Kumar, Hui Wen Goh, Hang Zhou Sep. 18, 2023 1108 2
Accelerate Time Series Modeling with Cleanlab Studio AutoML: Train and Deploy in Minutes Matt Turk Jul. 11, 2024 2053 -
The Office-Home Dataset (cited by 600+ papers) contains hundreds of incorrect labels and outliers. Chris Mauck, Jonas Mueller Apr. 21, 2023 478 -
Datalab: A Linter for ML Datasets Elías Snorrason, Sanjana Garg, Hui Wen Goh, Jesse Cummings, Jonas Mueller May. 16, 2023 1879 2
Finding Label Issues in Audio Classification Datasets Johnson Kuan, Jonas Mueller, Anish Athalye Apr. 27, 2022 2173 -
Automatically Find and Fix Issues in Image/Document Tags and other Multi-Label Datasets Chris Mauck, Ulyana Tkachenko Oct. 17, 2023 990 2
Most AI & Analytics are impaired by data issues. Now AI can help you fix them. Jonas Mueller, Curtis Northcutt, Anish Athalye Jul. 31, 2023 1948 1
How we built Cleanlab Vizzy Caleb Chiam, Luke Mainwaring, Yiming Chen Aug. 17, 2022 2388 -
cleanlab now supports all major ML tasks — including Regression, Object Detection, and Image Segmentation Chris Mauck, Curtis Northcutt, Jonas Mueller Sep. 14, 2023 1200 -
Automated Quality Assurance for Object Detection Datasets Ulyana Tkachenko, Aditya Thyagarajan, Jonas Mueller Sep. 26, 2023 1370 1
Handling Label Errors in Text Classification Datasets Wei Jing Lok, Jonas Mueller, Hui Wen Goh May. 10, 2022 3490 -
How to Filter Unsafe and Low-Quality Images from any Dataset: A Product Catalog Case Study Sanjana Garg, Jonas Mueller Jan. 22, 2024 1505 -
How to Generate Better Synthetic Image Datasets with Stable Diffusion Elías Snorrason, Jonas Mueller Oct. 05, 2023 2071 1
CROWDLAB: Simple and effective algorithms to handle data labeled by multiple annotators Hui Wen Goh, Ulyana Tkachenko, Jonas Mueller Oct. 05, 2022 1320 2
Cleanlab: The History, Present, and Future Curtis Northcutt(Co-Founder & CEO), (Co-Founder & CEO) Apr. 01, 2022 1849 -
cleanlab 2.0: Automatically Find Errors in ML Datasets Curtis Northcutt, Jonas Mueller, Anish Athalye Apr. 21, 2022 841 2
Automated Data Quality at Scale Anish Athalye, Angela Liu Jul. 27, 2023 1155 1
Automatic Error Detection for Image/Text Tagging and Multi-label Datasets Aditya Thyagarajan, Elías Snorrason, Curtis Northcutt, Jonas Mueller Nov. 29, 2022 1434 1
Out-of-Distribution Detection via Embeddings or Predictions Ulyana Tkachenko, Jonas Mueller Oct. 19, 2022 1264 -
Improving Legal Judgement Prediction with Data-Centric AI Hui Wen Goh Jun. 27, 2023 1658 -
A Simple Adjustment Improves Out-of-Distribution Detection for Any Classifier Ulyana Tkachenko, Jonas Mueller, Curtis Northcutt Oct. 19, 2022 1523 -
Handling Mislabeled Tabular Data to Improve Your XGBoost Model Chris Mauck Feb. 06, 2023 1877 2
Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5 Chris Mauck, Jonas Mueller Jun. 29, 2023 1366 66

By Matt Makai. 2021-2024.