SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
Abstract
SEAM is a benchmark that evaluates vision-language models' reasoning consistency across modalities using semantically equivalent inputs, revealing systematic modality imbalance and visual hallucinations.
Evaluating whether vision-language models (VLMs) reason consistently across representations is challenging because modality comparisons are typically confounded by task differences and asymmetric information. We introduce SEAM, a benchmark that pairs semantically equivalent inputs across four domains that have existing standardized textual and visual notations. By employing distinct notation systems across modalities, in contrast to OCR-based image-text pairing, SEAM provides a rigorous comparative assessment of the textual-symbolic and visual-spatial reasoning capabilities of VLMs. Across 21 contemporary models, we observe systematic modality imbalance: vision frequently lags language in overall performance, despite the problems containing semantically equivalent information, and cross-modal agreement is relatively low. Our error analysis reveals two main drivers: textual perception failures from tokenization in domain notation and visual perception failures that induce hallucinations. We also show that our results are largely robust to visual transformations. SEAM establishes a controlled, semantically equivalent setting for measuring and improving modality-agnostic reasoning.
Community
COLM 2025
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MAC: A Live Benchmark for Multimodal Large Language Models in Scientific Understanding (2025)
- MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models (2025)
- AURA: A Fine-Grained Benchmark and Decomposed Metric for Audio-Visual Reasoning (2025)
- Inverse-LLaVA: Eliminating Alignment Pre-training Through Text-to-Vision Mapping (2025)
- MDK12-Bench: A Comprehensive Evaluation of Multimodal Large Language Models on Multidisciplinary Exams (2025)
- Self-Rewarding Vision-Language Model via Reasoning Decomposition (2025)
- Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper