Company
Date Published
Author
Harpreet Sahota
Word count
2518
Language
English
Hacker News points
None

Summary

The AIMv2 model outperforms the CLIP model on the ImageNet-D synthetic dataset. ImageNet-D is a benchmark of synthetically generated images that pushes image classification models to their limits, revealing critical failures in model robustness. The dataset consists of 4,835 "hard images" with 113 overlapping categories between ImageNet and ObjectNet, and 547 nuisance variations such as backgrounds, textures, and materials. AIMv2 uses a novel multimodal autoregressive objective that generates image patches and text tokens, leveraging signals from all input tokens and patches for efficient training. The model excels in image recognition, grounding, and multimodal understanding tasks, consistently matching or outperforming existing self-supervised and vision-language pre-trained models. AIMv2's autoregressive approach is more resilient to synthetic variations than CLIP's contrastive learning, with a top-line accuracy of 41.92 compared to CLIP's 25.07 on the ImageNet-D dataset. The FiftyOne framework provides tools for exploring, analyzing, and visualizing the performance of AIMv2 and CLIP on ImageNet-D, including feature extraction, zero-shot classification, model evaluation, and hardness analysis.