SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

1Carnegie Mellon University, 2Mitsubishi Electric Research Labs
geometric reasoning

Overview of SPINBENCH task design across seven task groups. Representative subtasks are illustrated for each group with simplified question wording for clarity. In the released benchmark, all queries include explicit frame-of-reference definitions to avoid ambiguity. Human face data are sourced from the Stereo Face Database and are licensed for research use only.

Abstract

We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. Together, our findings highlight the need for structured, cognitively inspired diagnostic tools to advance spatial reasoning in multimodal foundation models.

Evaluations on 37 VLMs

eval-heatmap

In the figure below, Cohen's kappa values (κ) measure chance-adjusted performance, where κ=0 indicates chance-level and κ=1 perfect accuracy. Results is shown across 23 grouped task variants under 7 spatial reasoning categories.

Inconsistencies in logically equivalent spatial queries

consistency_analysis

Many models are inconsistent in logically equivalent spatial tasks. While InternVL3-38B achieves 95.7% consistency, many fall below 30%. A strong correlation (r = 0.874, p < 0.05) between accuracy and consistency.

Biased perspective

Models show a strong egocentric bias in dynamic rotation tasks, even when asked to adopt other viewpoints. Top performers on egocentric tasks do poorly on allocentric ones, likely due to training data favoring first-person views.

Scaling laws and emergent capability

human_analysis

Performance improves with model size but varies by task. Object relation grounding improves gradually, while identity matching shows a sharp jump once models exceed 7B parameters. These trends reflect emergent abilities and reveal gaps between small and large models.

Visual failures or linguistic failures

linguistic_analysis

Perspective-taking tasks test spatial reasoning across viewpoints. Even when spatial relations are clearly described (with information abstracted in the premise, no visual input needed), many models fail, showing that reasoning errors persist even in purely linguistic tasks. Top performing models perform also perform well on linguistic tasks.

Human response time and VLM accuracy

human_analysis

We conducted human evaluation with twelve subjects to establish performance baselines and validate task difficulty. Tasks taking longer for humans also result in lower VLM accuracy, showing a correlation (r = –0.54, p < 0.05). SpinBench captures captures spatial reasoning challenges shared across humans and VLMs, as a diagnostic tool.

BibTeX

@misc{zhang2025spinbenchperspectiverotationlens,
      title={SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs}, 
      author={Yuyou Zhang and Radu Corcodel and Chiori Hori and Anoop Cherian and Ding Zhao},
      year={2025},
      eprint={2509.25390},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.25390}, 
}