NeurIPS 2025  ·  ICML 2025 R2-FM

Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?

Yiwei Yang*, Chung Peng Lee*, Shangbin Feng, Dora Zhao, Bingbing Wen, Anthony Z. Liu, Yulia Tsvetkov, Bill Howe

University of Washington

arXiv PDF ← Back

TL;DR

We build SpuriVerse, a benchmark of 124 real-world spurious correlation types, and find that state-of-the-art vision-language models achieve only 37.1% accuracy—far below chance in some settings—revealing a fundamental gap in their ability to escape seen shortcuts.

Abstract

Large vision-language models (LVLMs) like GPT-4V, Gemini 1.5, Claude, and Qwen 2.5-VL demonstrate impressive performance across various tasks, but can they truly generalize beyond spurious correlations learned during training?

We introduce SpuriVerse, a novel benchmark comprised of 124 distinct types of spurious correlations extracted from real-world datasets, each containing 1 realistic and 10 synthetic VQA samples—1,364 multiple-choice questions in total—sourced from GPT-4o errors on visual question-answering tasks. When tested on 15 open and closed-source models, top performers achieved only 37.1% accuracy. However, fine-tuning on synthetic examples improved results to 78.40%, indicating that models can learn to avoid shortcuts and attend to broader image context when given the right training signal.

Benchmark at a Glance

124 Spurious correlation types
1,364 VQA questions
15 Models evaluated
37.1% Best model accuracy

Key Contributions

Citation

@article{yang2025spuriverse,
  title   = {Escaping the SpuriVerse: Can Large Vision-Language Models
             Generalize Beyond Seen Spurious Correlations?},
  author  = {Yang, Yiwei and Lee, Chung Peng and Feng, Shangbin and
             Zhao, Dora and Wen, Bingbing and Liu, Anthony Z. and
             Tsvetkov, Yulia and Howe, Bill},
  journal = {arXiv preprint arXiv:2506.18322},
  year    = {2025}
}