arxiv:2508.19813

T2R-bench: A Benchmark for Generating Article-Level Reports from Real World Industrial Tables

Published on Aug 27

· Submitted by

CSJianYang on Sep 2

#2 Paper of the day

Upvote

Authors:

Abstract

A bilingual benchmark named T2R-bench is proposed to evaluate the performance of large language models in generating reports from tables, highlighting the need for improvement in this task.

AI-generated summary

Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as 4 types of industrial tables. Furthermore, we propose an evaluation criteria to fairly measure the quality of report generation. The experiments on 25 widely-used LLMs reveal that even state-of-the-art models like Deepseek-R1 only achieves performance with 62.71 overall score, indicating that LLMs still have room for improvement on T2R-bench. Source code and data will be available after acceptance.

View arXiv page View PDF Add to collection

Community

CSJianYang

Paper submitter about 6 hours ago

😎 T2R-bench: A Benchmark for Generating Article-Level Reports from Real-World Industrial Tables
This paper introduces a new benchmark called T2R-bench, designed to evaluate how well large language models (LLMs) can generate detailed reports from complex industrial tables—a common yet challenging task in real-world applications.

🧩 Problem & Motivation:
While LLMs have improved in tasks like table QA and text-to-SQL, they still struggle with generating accurate, coherent, and insightful reports from diverse and complex industrial tables. Existing benchmarks don’t adequately reflect practical industrial needs.

📊 Dataset Overview:
T2R-bench includes 457 real industrial tables from 19 domains and 4 table types, reflecting high diversity and complexity. Each table is paired with a human-written reference report.

📐 Evaluation Criteria:
The authors propose a comprehensive evaluation framework to measure report quality, focusing on information accuracy, coherence, depth of analysis, and conclusion quality.

🤖 Experimental Insights:
Tests on 25 popular LLMs show that even top models like Deepseek-R1 only achieve 62.71% overall performance, indicating significant room for improvement in real-world table-to-report tasks.

🔮 Conclusion:
T2R-bench fills an important gap in evaluating LLMs for practical industrial report generation. The dataset and code will be released upon publication.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.19813 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.19813 in a Space README.md to link it from this page.