SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators Paper • 2502.06394 • Published Feb 10 • 86
Vinoground: Scrutinizing LMMs over Dense Temporal Reasoning with Short Videos Paper • 2410.02763 • Published Oct 3, 2024 • 7
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation Paper • 2409.06820 • Published Sep 10, 2024 • 66
MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems? Paper • 2403.14624 • Published Mar 21, 2024 • 52
Premise Order Matters in Reasoning with Large Language Models Paper • 2402.08939 • Published Feb 14, 2024 • 28