Watch Before You Answer: Learning from Visually Grounded Post-Training Paper • 2604.05117 • Published 3 days ago • 23
Dr. Bench: A Multidimensional Evaluation for Deep Research Agents, from Answers to Reports Paper • 2510.02190 • Published Jan 29 • 19
VideoScore2: Think before You Score in Generative Video Evaluation Paper • 2509.22799 • Published Sep 26, 2025 • 26
StructEval: Benchmarking LLMs' Capabilities to Generate Structural Outputs Paper • 2505.20139 • Published May 26, 2025 • 19
ScholarCopilot: Training Large Language Models for Academic Writing with Accurate Citations Paper • 2504.00824 • Published Apr 1, 2025 • 43