Company
Date Published
Author
Harpreet Sahota
Word count
3075
Language
English
Hacker News points
None

Summary

The paper introduces Visual Spectrogram Classification (VSC), a task where visual language models (VLMs) classify audio by analyzing spectrograms. The ESC-10 dataset is used to test the hypothesis that VLMs can effectively bridge the visual-audio domains. The authors explore the intersection of vision and audio understanding using three models: CLAP, Music2Latent, and AIMv2. They find that CLAP embeddings show clear clustering between sound categories, while Music2Latent shows moderate clustering with some overlap between categories. AIMv2 embeddings, however, show significant mixing between categories with no clear clustering pattern. The authors hypothesize that the specialized audio model (CLAP) will significantly outperform the VLM approach on zero-shot classification tasks. They implement both approaches and evaluate their performance using model evaluation panels in FiftyOne. Janus-Pro's performance on zero-shot classification is disappointing, but it provides valuable insights into the limitations of treating audio classification as a purely visual task. The experiment highlights the importance of domain-specific architectures and suggests that with few-shot learning, larger models, and better prompt engineering, VLMs might still have untapped potential in audio understanding.