SpuriVerse | Yiwei Yang

TL;DR

We build SpuriVerse, a benchmark of 124 real-world spurious correlation types, and find that state-of-the-art vision-language models achieve only 37.1% accuracy—far below chance in some settings—revealing a fundamental gap in their ability to escape seen shortcuts.

Abstract

Large vision-language models (LVLMs) like GPT-4V, Gemini 1.5, Claude, and Qwen 2.5-VL demonstrate impressive performance across various tasks, but can they truly generalize beyond spurious correlations learned during training?

We introduce SpuriVerse, a novel benchmark comprised of 124 distinct types of spurious correlations extracted from real-world datasets, each containing 1 realistic and 10 synthetic VQA samples—1,364 multiple-choice questions in total—sourced from GPT-4o errors on visual question-answering tasks. When tested on 15 open and closed-source models, top performers achieved only 37.1% accuracy. However, fine-tuning on synthetic examples improved results to 78.40%, indicating that models can learn to avoid shortcuts and attend to broader image context when given the right training signal.

Benchmark at a Glance

124 Spurious correlation types

1,364 VQA questions

15 Models evaluated

37.1% Best model accuracy

Key Contributions

SpuriVerse benchmark: 124 spurious correlation types extracted from real-world datasets, each with realistic and synthetic VQA samples covering a broad range of visual shortcuts.
Systematic evaluation of 15 LVLMs: Comprehensive assessment of open and closed-source models, revealing that even state-of-the-art systems fall far short on spurious pattern generalization.
Fine-tuning intervention: Demonstrated that synthetic fine-tuning on diverse spurious patterns improves generalization from 37.1% to 78.40%, pointing toward a tractable solution.
Failure mode analysis: Identified when and how models rely on dataset artifacts rather than genuine visual reasoning, providing actionable insights for building more robust VLMs.

Citation

@article{yang2025spuriverse,
  title   = {Escaping the SpuriVerse: Can Large Vision-Language Models
             Generalize Beyond Seen Spurious Correlations?},
  author  = {Yang, Yiwei and Lee, Chung Peng and Feng, Shangbin and
             Zhao, Dora and Wen, Bingbing and Liu, Anthony Z. and
             Tsvetkov, Yulia and Howe, Bill},
  journal = {arXiv preprint arXiv:2506.18322},
  year    = {2025}
}