See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models Paper • 2512.02231 • Published 28 days ago • 8
X-Fusion: Introducing New Modality to Frozen Large Language Models Paper • 2504.20996 • Published Apr 29 • 13
Visual Instruction Inversion: Image Editing via Visual Prompting Paper • 2307.14331 • Published Jul 26, 2023 • 1