Can VLMs Hear What They See?

Company

Voxel51

Date Published

Feb. 20, 2025

Author

Harpreet Sahota

Word count

3075

Language

English

Hacker News points

None

URL

voxel51.com/blog/can-vlms-hear-what-they-see

Summary

The paper introduces Visual Spectrogram Classification (VSC), a task where visual language models (VLMs) classify audio by analyzing spectrograms. The ESC-10 dataset is used to test the hypothesis that VLMs can effectively bridge the visual-audio domains. The authors explore the intersection of vision and audio understanding using three models: CLAP, Music2Latent, and AIMv2. They find that CLAP embeddings show clear clustering between sound categories, while Music2Latent shows moderate clustering with some overlap between categories. AIMv2 embeddings, however, show significant mixing between categories with no clear clustering pattern. The authors hypothesize that the specialized audio model (CLAP) will significantly outperform the VLM approach on zero-shot classification tasks. They implement both approaches and evaluate their performance using model evaluation panels in FiftyOne. Janus-Pro's performance on zero-shot classification is disappointing, but it provides valuable insights into the limitations of treating audio classification as a purely visual task. The experiment highlights the importance of domain-specific architectures and suggests that with few-shot learning, larger models, and better prompt engineering, VLMs might still have untapped potential in audio understanding.