Submitted by lixiaochuan 72 DropletVideo: A Dataset and Approach to Explore Integral Spatio-Temporal Consistent Video Generation · 13 authors 1
Submitted by tellarin 49 Being-0: A Humanoid Robotic Agent with Vision-Language Models and Modular Skills · 9 authors 1
Submitted by limuloo1999 34 DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models · 4 authors 2
Submitted by yyyyyyjjjjzzz 22 SPIN-Bench: How Well Do LLMs Plan Strategically and Reason Socially? · 8 authors 2
Submitted by Orannue 21 Edit Transfer: Learning Image Editing via Vision In-Context Relations · 4 authors 4
Submitted by ZyZcuhk 19 BlobCtrl: A Unified and Flexible Framework for Element-level Image Generation and Editing · 9 authors 1
Submitted by Lingaaaaaaa 16 WideRange4D: Enabling High-Quality 4D Reconstruction with Wide-Range Movements and Scenes · 8 authors 1
Submitted by jmhb 16 MicroVQA: A Multimodal Reasoning Benchmark for Microscopy-Based Scientific Research · 23 authors 1
Submitted by ZhaofengWu 14 reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs · 6 authors 1
Submitted by akhaliq 13 R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization · 7 authors 1
Submitted by lwpyh 10 V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning · 6 authors 1
Submitted by Luo-Yihong 6 Rewards Are Enough for Fast Photo-Realistic Text-to-image Generation · 5 authors 1
Submitted by Buzz-lightyear 6 Long-Video Audio Synthesis with Multi-Agent Collaboration · 5 authors 2
Submitted by k-nick 5 Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework · 8 authors 1
Submitted by soarhigh 3 Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions · 7 authors 1
Submitted by JesseTNRoberts 2 Investigating Human-Aligned Large Language Model Uncertainty · 4 authors 1
Submitted by FQiao 1 GenStereo: Towards Open-World Generation of Stereo Images and Unsupervised Matching · 4 authors 1
Submitted by zxbsmk - WISA: World Simulator Assistant for Physics-Aware Text-to-Video Generation · 12 authors 1
Submitted by Sckathach - Using Mechanistic Interpretability to Craft Adversarial Attacks against Large Language Models · 3 authors 1