kaizuberbuehler
's Collections
Reasoning, Thinking, RL and Test-Time Scaling
updated
Mulberry: Empowering MLLM with o1-like Reasoning and Reflection via
Collective Monte Carlo Tree Search
Paper
•
2412.18319
•
Published
•
39
Token-Budget-Aware LLM Reasoning
Paper
•
2412.18547
•
Published
•
46
Efficiently Serving LLM Reasoning Programs with Certaindex
Paper
•
2412.20993
•
Published
•
37
B-STaR: Monitoring and Balancing Exploration and Exploitation in
Self-Taught Reasoners
Paper
•
2412.17256
•
Published
•
47
Paper
•
2412.16720
•
Published
•
33
DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought
Paper
•
2412.17498
•
Published
•
22
Outcome-Refining Process Supervision for Code Generation
Paper
•
2412.15118
•
Published
•
19
Critical Tokens Matter: Token-Level Contrastive Estimation Enhence LLM's
Reasoning Capability
Paper
•
2411.19943
•
Published
•
60
MALT: Improving Reasoning with Multi-Agent LLM Training
Paper
•
2412.01928
•
Published
•
44
Mars-PO: Multi-Agent Reasoning System Preference Optimization
Paper
•
2411.19039
•
Published
•
1
Flow-DPO: Improving LLM Mathematical Reasoning through Online
Multi-Agent Learning
Paper
•
2410.22304
•
Published
•
18
o1-Coder: an o1 Replication for Coding
Paper
•
2412.00154
•
Published
•
44
Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
Paper
•
2411.14405
•
Published
•
61
OpenR: An Open Source Framework for Advanced Reasoning with Large
Language Models
Paper
•
2410.09671
•
Published
•
1
SRA-MCTS: Self-driven Reasoning Augmentation with Monte Carlo Tree
Search for Code Generation
Paper
•
2411.11053
•
Published
•
4
Beyond Examples: High-level Automated Reasoning Paradigm in In-Context
Learning via MCTS
Paper
•
2411.18478
•
Published
•
37
Reverse Thinking Makes LLMs Stronger Reasoners
Paper
•
2411.19865
•
Published
•
22
Enhancing LLM Reasoning via Critique Models with Test-Time and
Training-Time Supervision
Paper
•
2411.16579
•
Published
•
3
Vision-Language Models Can Self-Improve Reasoning via Reflection
Paper
•
2411.00855
•
Published
•
5
Language Models are Hidden Reasoners: Unlocking Latent Reasoning
Capabilities via Self-Rewarding
Paper
•
2411.04282
•
Published
•
35
Insight-V: Exploring Long-Chain Visual Reasoning with Multimodal Large
Language Models
Paper
•
2411.14432
•
Published
•
25
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning
Paper
•
2411.18203
•
Published
•
36
O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple
Distillation, Big Progress or Bitter Lesson?
Paper
•
2411.16489
•
Published
•
48
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained
Video Reasoning via Core Frame Selection
Paper
•
2411.14794
•
Published
•
13
Enhancing the Reasoning Ability of Multimodal Large Language Models via
Mixed Preference Optimization
Paper
•
2411.10442
•
Published
•
80
LLaMA-Berry: Pairwise Optimization for O1-like Olympiad-Level
Mathematical Reasoning
Paper
•
2410.02884
•
Published
•
54
LLaVA-o1: Let Vision Language Models Reason Step-by-Step
Paper
•
2411.10440
•
Published
•
122
Large Language Models Can Self-Improve in Long-context Reasoning
Paper
•
2411.08147
•
Published
•
65
Self-Consistency Preference Optimization
Paper
•
2411.04109
•
Published
•
18
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep
Thinking
Paper
•
2501.04519
•
Published
•
270
URSA: Understanding and Verifying Chain-of-thought Reasoning in
Multimodal Mathematics
Paper
•
2501.04686
•
Published
•
52
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta
Chain-of-Though
Paper
•
2501.04682
•
Published
•
94
BoostStep: Boosting mathematical capability of Large Language Models via
improved single-step reasoning
Paper
•
2501.03226
•
Published
•
44
Test-time Computing: from System-1 Thinking to System-2 Thinking
Paper
•
2501.02497
•
Published
•
44
Virgo: A Preliminary Exploration on Reproducing o1-like MLLM
Paper
•
2501.01904
•
Published
•
33
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Paper
•
2412.21187
•
Published
•
41
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Paper
•
2501.05366
•
Published
•
100
The Lessons of Developing Process Reward Models in Mathematical
Reasoning
Paper
•
2501.07301
•
Published
•
96
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical
Reasoning
Paper
•
2501.06458
•
Published
•
31
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
Paper
•
2501.06186
•
Published
•
64
OmniThink: Expanding Knowledge Boundaries in Machine Writing through
Thinking
Paper
•
2501.09751
•
Published
•
48
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with
Large Language Models
Paper
•
2501.09686
•
Published
•
39
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning
Paper
•
2501.12948
•
Published
•
364
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Paper
•
2501.12599
•
Published
•
111
s1: Simple test-time scaling
Paper
•
2501.19393
•
Published
•
113
Demystifying Long Chain-of-Thought Reasoning in LLMs
Paper
•
2502.03373
•
Published
•
58
Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time
Scaling
Paper
•
2502.06703
•
Published
•
147
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model
Post-training
Paper
•
2501.17161
•
Published
•
116
On the Emergence of Thinking in LLMs I: Searching for the Right
Intuition
Paper
•
2502.06773
•
Published
•
1
Competitive Programming with Large Reasoning Models
Paper
•
2502.06807
•
Published
•
67
Evolving Deeper LLM Thinking
Paper
•
2501.09891
•
Published
•
112
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs)
More Self-Confident Even When They Are Wrong
Paper
•
2501.09775
•
Published
•
32
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward
Model
Paper
•
2501.12368
•
Published
•
44
Reasoning Language Models: A Blueprint
Paper
•
2501.11223
•
Published
•
32
O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning
Paper
•
2501.12570
•
Published
•
26
Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament
Paper
•
2501.13007
•
Published
•
20
Can We Generate Images with CoT? Let's Verify and Reinforce Image
Generation Step by Step
Paper
•
2501.13926
•
Published
•
41
Step-KTO: Optimizing Mathematical Reasoning through Stepwise Binary
Feedback
Paper
•
2501.10799
•
Published
•
15
Chain-of-Retrieval Augmented Generation
Paper
•
2501.14342
•
Published
•
54
RL + Transformer = A General-Purpose Problem Solver
Paper
•
2501.14176
•
Published
•
27
Towards General-Purpose Model-Free Reinforcement Learning
Paper
•
2501.16142
•
Published
•
28
Atla Selene Mini: A General Purpose Evaluation Model
Paper
•
2501.17195
•
Published
•
35
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Paper
•
2501.18585
•
Published
•
59
Large Language Models Think Too Fast To Explore Effectively
Paper
•
2501.18009
•
Published
•
23
Reward-Guided Speculative Decoding for Efficient LLM Reasoning
Paper
•
2501.19324
•
Published
•
38
Process Reinforcement through Implicit Rewards
Paper
•
2502.01456
•
Published
•
57
NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions
Paper
•
2502.13124
•
Published
•
5
ACECODER: Acing Coder RL via Automated Test-Case Synthesis
Paper
•
2502.01718
•
Published
•
28
Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM
Reasoning via Autoregressive Search
Paper
•
2502.02508
•
Published
•
23
QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search
Paper
•
2502.02584
•
Published
•
17
LIMO: Less is More for Reasoning
Paper
•
2502.03387
•
Published
•
60
Boosting Multimodal Reasoning with MCTS-Automated Structured Thinking
Paper
•
2502.02339
•
Published
•
22
On Teacher Hacking in Language Model Distillation
Paper
•
2502.02671
•
Published
•
18
Token Assorted: Mixing Latent and Text Tokens for Improved Language
Model Reasoning
Paper
•
2502.03275
•
Published
•
16
Gold-medalist Performance in Solving Olympiad Geometry with
AlphaGeometry2
Paper
•
2502.03544
•
Published
•
43
BOLT: Bootstrap Long Chain-of-Thought in Language Models without
Distillation
Paper
•
2502.03860
•
Published
•
24
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth
Approach
Paper
•
2502.05171
•
Published
•
129
DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM
Guardrails
Paper
•
2502.05163
•
Published
•
22
Step Back to Leap Forward: Self-Backtracking for Boosting Reasoning of
Language Models
Paper
•
2502.04404
•
Published
•
24
Generating Symbolic World Models via Test-time Scaling of Large Language
Models
Paper
•
2502.04728
•
Published
•
19
Exploring the Limit of Outcome Reward for Learning Mathematical
Reasoning
Paper
•
2502.06781
•
Published
•
60
Training Language Models for Social Deduction with Multi-Agent
Reinforcement Learning
Paper
•
2502.06060
•
Published
•
34
ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
Paper
•
2502.06772
•
Published
•
21
CodeI/O: Condensing Reasoning Patterns via Code Input-Output Prediction
Paper
•
2502.07316
•
Published
•
47
LLMs Can Easily Learn to Reason from Demonstrations Structure, not
content, is what matters!
Paper
•
2502.07374
•
Published
•
37
Teaching Language Models to Critique via Reinforcement Learning
Paper
•
2502.03492
•
Published
•
24
Fino1: On the Transferability of Reasoning Enhanced LLMs to Finance
Paper
•
2502.08127
•
Published
•
52
Ignore the KL Penalty! Boosting Exploration on Critical Tokens to
Enhance RL Fine-Tuning
Paper
•
2502.06533
•
Published
•
18
An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in
One Day via Model Merging
Paper
•
2502.09056
•
Published
•
30
SelfCite: Self-Supervised Alignment for Context Attribution in Large
Language Models
Paper
•
2502.09604
•
Published
•
34
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for
Reasoning Quality, Robustness, and Efficiency
Paper
•
2502.09621
•
Published
•
27
Logical Reasoning in Large Language Models: A Survey
Paper
•
2502.09100
•
Published
•
22
SQuARE: Sequential Question Answering Reasoning Engine for Enhanced
Chain-of-Thought in Large Language Models
Paper
•
2502.09390
•
Published
•
16
Typhoon T1: An Open Thai Reasoning Model
Paper
•
2502.09042
•
Published
•
16
CoT-Valve: Length-Compressible Chain-of-Thought Tuning
Paper
•
2502.09601
•
Published
•
14
Mathematical Reasoning in Large Language Models: Assessing Logical and
Arithmetic Errors across Wide Numerical Ranges
Paper
•
2502.08680
•
Published
•
11
Small Models Struggle to Learn from Strong Reasoners
Paper
•
2502.12143
•
Published
•
32
S*: Test Time Scaling for Code Generation
Paper
•
2502.14382
•
Published
•
61
Diverse Inference and Verification for Advanced Reasoning
Paper
•
2502.09955
•
Published
•
17
Search-R1: Training LLMs to Reason and Leverage Search Engines with
Reinforcement Learning
Paper
•
2503.09516
•
Published
•
25
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open
Software Evolution
Paper
•
2502.18449
•
Published
•
71
Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement
Learning
Paper
•
2502.14768
•
Published
•
47
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via
GRPO
Paper
•
2502.14669
•
Published
•
12
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning
in Diffusion Models
Paper
•
2502.10458
•
Published
•
33
S^2R: Teaching LLMs to Self-verify and Self-correct via Reinforcement
Learning
Paper
•
2502.12853
•
Published
•
29
Thinking Preference Optimization
Paper
•
2502.13173
•
Published
•
17
Self-rewarding correction for mathematical reasoning
Paper
•
2502.19613
•
Published
•
82
Can Large Language Models Detect Errors in Long Chain-of-Thought
Reasoning?
Paper
•
2502.19361
•
Published
•
27
LightThinker: Thinking Step-by-Step Compression
Paper
•
2502.15589
•
Published
•
27
Reinforcement Learning for Reasoning in Small LLMs: What Works and What
Doesn't
Paper
•
2503.16219
•
Published
•
43
R1-T1: Fully Incentivizing Translation Capability in LLMs via Reasoning
Learning
Paper
•
2502.19735
•
Published
•
8
PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning
Trajectories for Complex Problem Solving
Paper
•
2502.16111
•
Published
•
9
TAG: A Decentralized Framework for Multi-Agent Hierarchical
Reinforcement Learning
Paper
•
2502.15425
•
Published
•
9
The Relationship Between Reasoning and Performance in Large Language
Models -- o3 (mini) Thinks Harder, Not Longer
Paper
•
2502.15631
•
Published
•
9
Multimodal Inconsistency Reasoning (MMIR): A New Benchmark for
Multimodal Reasoning Models
Paper
•
2502.16033
•
Published
•
17
Linguistic Generalizability of Test-Time Scaling in Mathematical
Reasoning
Paper
•
2502.17407
•
Published
•
25
VEM: Environment-Free Exploration for Training GUI Agent with Value
Environment Model
Paper
•
2502.18906
•
Published
•
12
Agentic Reward Modeling: Integrating Human Preferences with Verifiable
Correctness Signals for Reliable Reward Systems
Paper
•
2502.19328
•
Published
•
22
START: Self-taught Reasoner with Tools
Paper
•
2503.04625
•
Published
•
98
Visual-RFT: Visual Reinforcement Fine-Tuning
Paper
•
2503.01785
•
Published
•
70
Chain of Draft: Thinking Faster by Writing Less
Paper
•
2502.18600
•
Published
•
46
Process-based Self-Rewarding Language Models
Paper
•
2503.03746
•
Published
•
37
DeepSolution: Boosting Complex Engineering Solution Design via
Tree-based Exploration and Bi-point Thinking
Paper
•
2502.20730
•
Published
•
38