AIMv2 Outperforms CLIP on Synthetic Dataset ImageNet-D

Company

Voxel51

Date Published

Feb. 12, 2025

Author

Harpreet Sahota

Word count

2518

Language

English

Hacker News points

None

URL

voxel51.com/blog/aimv2-outperforms-clip-on-synthetic-dataset-imagenet-d

Summary

The AIMv2 model outperforms the CLIP model on the ImageNet-D synthetic dataset. ImageNet-D is a benchmark of synthetically generated images that pushes image classification models to their limits, revealing critical failures in model robustness. The dataset consists of 4,835 "hard images" with 113 overlapping categories between ImageNet and ObjectNet, and 547 nuisance variations such as backgrounds, textures, and materials. AIMv2 uses a novel multimodal autoregressive objective that generates image patches and text tokens, leveraging signals from all input tokens and patches for efficient training. The model excels in image recognition, grounding, and multimodal understanding tasks, consistently matching or outperforming existing self-supervised and vision-language pre-trained models. AIMv2's autoregressive approach is more resilient to synthetic variations than CLIP's contrastive learning, with a top-line accuracy of 41.92 compared to CLIP's 25.07 on the ImageNet-D dataset. The FiftyOne framework provides tools for exploring, analyzing, and visualizing the performance of AIMv2 and CLIP on ImageNet-D, including feature extraction, zero-shot classification, model evaluation, and hardness analysis.