Qwen-Image-Layered: Towards Inherent Editability via Layer Decomposition Paper • 2512.15603 • Published 8 days ago • 55
MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence Paper • 2512.10863 • Published 14 days ago • 21
Evaluating Gemini Robotics Policies in a Veo World Simulator Paper • 2512.10675 • Published 15 days ago • 16
SeeNav-Agent: Enhancing Vision-Language Navigation with Visual Prompt and Step-Level Policy Optimization Paper • 2512.02631 • Published 24 days ago • 8
TV2TV: A Unified Framework for Interleaved Language and Video Generation Paper • 2512.05103 • Published 21 days ago • 16
SIMA 2: A Generalist Embodied Agent for Virtual Worlds Paper • 2512.04797 • Published 22 days ago • 23
ProPhy: Progressive Physical Alignment for Dynamic World Simulation Paper • 2512.05564 • Published 21 days ago • 5
COOPER: A Unified Model for Cooperative Perception and Reasoning in Spatial Intelligence Paper • 2512.04563 • Published 22 days ago • 14
Embodied Referring Expression Comprehension in Human-Robot Interaction Paper • 2512.06558 • Published 19 days ago • 3
VideoVLA: Video Generators Can Be Generalizable Robot Manipulators Paper • 2512.06963 • Published 18 days ago • 3
Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation Paper • 2512.08186 • Published 17 days ago • 21
MIND-V: Hierarchical Video Generation for Long-Horizon Robotic Manipulation with RL-based Physical Alignment Paper • 2512.06628 • Published 19 days ago • 12
OneStory: Coherent Multi-Shot Video Generation with Adaptive Memory Paper • 2512.07802 • Published 17 days ago • 43
Reflection Removal through Efficient Adaptation of Diffusion Transformers Paper • 2512.05000 • Published 21 days ago • 15
WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents Paper • 2504.15785 • Published Apr 22 • 22
Agent0-VL: Exploring Self-Evolving Agent for Tool-Integrated Vision-Language Reasoning Paper • 2511.19900 • Published about 1 month ago • 47
PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image Paper • 2511.13648 • Published Nov 17 • 52
GigaWorld-0: World Models as Data Engine to Empower Embodied AI Paper • 2511.19861 • Published about 1 month ago • 30
Chain-of-Visual-Thought: Teaching VLMs to See and Think Better with Continuous Visual Tokens Paper • 2511.19418 • Published Nov 24 • 27