ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use Paper • 2501.02506 • Published Jan 5 • 11
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning Paper • 2501.03226 • Published Jan 6 • 37
Test-time Computing: from System-1 Thinking to System-2 Thinking Paper • 2501.02497 • Published Jan 5 • 41