TL;DR
Abstract
Large vision-language models (LVLMs) like GPT-4V, Gemini 1.5, Claude, and Qwen 2.5-VL demonstrate impressive performance across various tasks, but can they truly generalize beyond spurious correlations learned during training?
We introduce SpuriVerse, a novel benchmark comprised of 124 distinct types of spurious correlations extracted from real-world datasets, each containing 1 realistic and 10 synthetic VQA samples—1,364 multiple-choice questions in total—sourced from GPT-4o errors on visual question-answering tasks. When tested on 15 open and closed-source models, top performers achieved only 37.1% accuracy. However, fine-tuning on synthetic examples improved results to 78.40%, indicating that models can learn to avoid shortcuts and attend to broader image context when given the right training signal.
Benchmark at a Glance
Key Contributions
- SpuriVerse benchmark: 124 spurious correlation types extracted from real-world datasets, each with realistic and synthetic VQA samples covering a broad range of visual shortcuts.
- Systematic evaluation of 15 LVLMs: Comprehensive assessment of open and closed-source models, revealing that even state-of-the-art systems fall far short on spurious pattern generalization.
- Fine-tuning intervention: Demonstrated that synthetic fine-tuning on diverse spurious patterns improves generalization from 37.1% to 78.40%, pointing toward a tractable solution.
- Failure mode analysis: Identified when and how models rely on dataset artifacts rather than genuine visual reasoning, providing actionable insights for building more robust VLMs.
Citation
@article{yang2025spuriverse,
title = {Escaping the SpuriVerse: Can Large Vision-Language Models
Generalize Beyond Seen Spurious Correlations?},
author = {Yang, Yiwei and Lee, Chung Peng and Feng, Shangbin and
Zhao, Dora and Wen, Bingbing and Liu, Anthony Z. and
Tsvetkov, Yulia and Howe, Bill},
journal = {arXiv preprint arXiv:2506.18322},
year = {2025}
}