The development of Vision Language Models (VLMs) has seen significant progress in recent years, but their evaluation remains a challenge. Current benchmarks often fail to accurately assess a VLM's ability to understand visual content, leading to concerns about whether these evaluations measure a model's true capabilities. To address this issue, the paper "NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples" introduces a new benchmark that emphasizes vision-centric evaluation and is designed to provide a more accurate assessment by forcing models to depend on visual input. The results reveal that even state-of-the-art VLMs struggle with tasks humans find trivial, highlighting the need for further research to develop more robust VLMs. This work underscores the need to critically re-evaluate existing VQA benchmarks and adopt new approaches like NaturalBench to ensure accurate progress measurement in VLM development.