SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

Abstract

We present SpinBench, a cognitively grounded diagnostic benchmark for evaluating spatial reasoning in vision language models (VLMs). SpinBench is designed around the core challenge of spatial reasoning: perspective taking, the ability to reason about how scenes and object relations change under viewpoint transformation. Since perspective taking requires multiple cognitive capabilities, such as recognizing objects across views, relative positions grounding, and mentally simulating transformations, SpinBench introduces a set of fine-grained diagnostic categories. Our categories target translation, rotation, object relative pose, and viewpoint change, and are progressively structured so that single-object simpler tasks scaffold toward the most demanding multi-object perspective-taking setting. We evaluate 37 state-of-the-art VLMs, both proprietary and open source. Results reveal systematic weaknesses: strong egocentric bias, poor rotational understanding, and inconsistencies under symmetrical and syntactic reformulations. Scaling analysis shows both smooth improvements and emergent capabilities. While human subjects achieve high accuracy (91.2%), task difficulty as measured by human response time shows strong correlation with VLM accuracy, indicating that SpinBench captures spatial reasoning challenges shared across humans and VLMs. We believe SpinBench provides critical insights into spatial reasoning in VLMs and highlights key gaps in their ability to reason about physical space.

Evaluations on 37 VLMs

In the figure below, Cohen's kappa values (κ) measure chance-adjusted performance, where κ=0 indicates chance-level and κ=1 perfect accuracy. Results is shown across 23 grouped task variants under 7 spatial reasoning categories.

Inconsistencies in logically equivalent spatial queries

Many models are inconsistent in logically equivalent spatial tasks. While InternVL3-38B achieves 95.7% consistency, many fall below 30%. A strong correlation (r = 0.874, p < 0.05) between accuracy and consistency.

Biased perspective

Models show a strong egocentric bias in dynamic rotation tasks, even when asked to adopt other viewpoints. Top performers on egocentric tasks do poorly on allocentric ones, likely due to training data favoring first-person views.

Scaling laws and emergent capability

Performance improves with model size but varies by task. Object relation grounding improves gradually, while identity matching shows a sharp jump once models exceed 7B parameters. These trends reflect emergent abilities and reveal gaps between small and large models.

Visual failures or linguistic failures

Perspective-taking tasks test spatial reasoning across viewpoints. Even when spatial relations are clearly described (with information abstracted in the premise, no visual input needed), many models fail, showing that reasoning errors persist even in purely linguistic tasks. Top performing models perform also perform well on linguistic tasks.

Human response time and VLM accuracy

We conducted human evaluation with twelve subjects to establish performance baselines and validate task difficulty. Tasks taking longer for humans also result in lower VLM accuracy, showing a correlation (r = –0.54, p < 0.05). SpinBench captures captures spatial reasoning challenges shared across humans and VLMs, as a diagnostic tool.

BibTeX

@article{zhang2025spinbench,
  title={SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs},
  author={Zhang, Yuyou and Corcodel, Radu and Hori, Chiori and Cherian, Anoop and Zhao, Ding},
  journal={arXiv preprint arXiv:2509.25390},
  year={2025}
}