Papers
arxiv:2508.06009

MathReal: We Keep It Real! A Real Scene Benchmark for Evaluating Math Reasoning in Multimodal Large Language Models

Published on Aug 8
· Submitted by junfeng0288 on Aug 14
Authors:
,
,
,
,
,
,

Abstract

MathReal, a dataset of real-world mathematical questions with images, evaluates the performance of multimodal large language models in authentic educational settings, highlighting challenges and providing insights for improvement.

AI-generated summary

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in visual mathematical reasoning across various existing benchmarks. However, these benchmarks are predominantly based on clean or processed multimodal inputs, without incorporating the images provided by real-world Kindergarten through 12th grade (K-12) educational users. To address this gap, we introduce MathReal, a meticulously curated dataset comprising 2,000 mathematical questions with images captured by handheld mobile devices in authentic scenarios. Each question is an image, containing the question text and visual element. We systematically classify the real images into three primary categories: image quality degradation, perspective variation, and irrelevant content interference, which are further delineated into 14 subcategories. Additionally, MathReal spans five core knowledge and ability categories, which encompass three question types and are divided into three difficulty levels. To comprehensively evaluate the multimodal mathematical reasoning abilities of state-of-the-art MLLMs in real-world scenarios, we design six experimental settings that enable a systematic analysis of their performance. Through extensive experimentation, we find that the problem-solving abilities of existing MLLMs are significantly challenged in realistic educational contexts. Based on this, we conduct a thorough analysis of their performance and error patterns, providing insights into their recognition, comprehension, and reasoning capabilities, and outlining directions for future improvements. Data and code: https://github.com/junfeng0288/MathReal.

Community

Paper author Paper submitter

🎯 In this paper, we present MathReal, the first real-world benchmark specifically created to evaluate the visual mathematical reasoning capabilities of Multimodal Large Language Models (MLLMs) in authentic K–12 educational contexts. Our work introduces a challenging and realistic dataset that moves beyond existing clean-image benchmarks and reflects the noisy, imperfect inputs found in everyday educational practice.

Key features of our approach include:

1.Real-World Multimodal Math Dataset: MathReal contains 2,000 math questions captured by handheld mobile devices in natural settings, with both question text and figures embedded within the image. The dataset covers three major categories of real-world challenges: image quality degradation, perspective variation, and irrelevant content interference. These categories are further divided into 14 fine-grained subcategories to provide a detailed representation of authentic conditions.

2.Comprehensive Coverage and Annotation: Each question is categorized across five core knowledge and ability areas, three question types, and three difficulty levels, spanning the full K–12 curriculum. All samples are carefully annotated with ground-truth text, detailed figure descriptions, and correct answers, with verification by expert annotators to ensure accuracy and consistency.

3.Rigorous Evaluation and Insights: We conduct evaluations of 40 MLLMs under six experimental settings, revealing that even top-performing models experience substantial performance drops when processing real-world images. Controlled experiments confirm that visual imperfections such as blur, rotation, and handwritten content significantly hinder both perception and reasoning, highlighting the gap between current model performance and real-world requirements.

The MathReal dataset and code will be made available at https://github.com/junfeng0288/MathReal, offering a valuable resource for advancing robust multimodal mathematical reasoning in real educational scenarios. 🚀

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.06009 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.06009 in a Space README.md to link it from this page.

Collections including this paper 2