/plushcap/analysis/activeloop/activeloop-how-to-create-collaborative-machine-learning-datasets-for-projects-gathering-50-collaborators

How to create collaborative Machine Learning datasets for projects gathering 50+ collaborators

What's this blog post about?

The article discusses how Omdena successfully utilized collaborative Machine Learning (ML) datasets in their project with the Global Partnership for Sustainable Development Data (GPSDD). The goal of this project was to use ML to improve food security in Senegal. To achieve this, they tackled various issues related to crop yield prediction, climate risk, crop diseases, deforestation, and food storage/transport. The team faced several challenges such as handling heavy raw satellite images, preparing preprocessed data for Deep Learning models, and ensuring accessibility of these datasets for all collaborators. They resolved most of these issues by using Activeloop, a fast and simple framework for building and scaling data pipelines for ML. Activeloop was used to store the datasets used for training their Deep Learning models. The datasets comprised 32-bins histograms of satellite images, normalized difference vegetation indexes (ndvi), and yields (ground truth) values. Each dataset was loaded from the Activeloop hub using its unique path, making it easier to combine datasets when needed. The team also used TensorFlow functions to split data into train, validation, and test sets, and then trained their CNN model using these common lines of commands. The use of Activeloop ensured that all developers had access to the same dataset without needing to store it locally. Additionally, updating a dataset was straightforward as it only required re-uploading the updated dataset to the same tag. The authors suggest that while they could have stored satellite images in Activeloop, they chose to keep them in an S3 bucket for easier access and processing. They also note that there may be room for improvement in how ground truth yield values are stored using available schemas more efficiently. Overall, the use of collaborative ML datasets through Activeloop proved beneficial in this project by ensuring efficient storage, easy accessibility, and consistent paths for all collaborators.

Company
Activeloop

Date published
June 16, 2021

Author(s)
Margaux Masson-...

Word count
1163

Hacker News points
None found.

Language
English


By Matt Makai. 2021-2024.