arxiv:2602.13517

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Published on Feb 13

Authors:

Abstract

Deep-thinking tokens identified through revision patterns in model layers show stronger correlation with reasoning accuracy than generation length, enabling efficient test-time scaling strategy.

AI-generated summary

Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

View arXiv page View PDF Add to collection

Community

zuom

8 days ago

Wrote a HF-compatible implementation: deep-think-tokens

grantsing

2 days ago

nice summary of this paper on measuring reasoning effort via deep thinking tokens here https://arxivexplained.com/paper/think-deep-not-just-long-measuring-llm-reasoning-effort-via-deep-thinking-tokens the distinction between depth vs length of reasoning is something i hadn't really considered before

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2602.13517 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2602.13517 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2602.13517 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.