Improving Token-Based World Models with Parallel Observation Prediction Paper • 2402.05643 • Published Feb 8, 2024 • 1
DeeR-VLA: Dynamic Inference of Multimodal Large Language Models for Efficient Robot Execution Paper • 2411.02359 • Published Nov 4, 2024 • 13
Classification Done Right for Vision-Language Pre-Training Paper • 2411.03313 • Published Nov 5, 2024
Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models Paper • 2412.14058 • Published Dec 18, 2024 • 1
Image Understanding Makes for A Good Tokenizer for Image Generation Paper • 2411.04406 • Published Nov 7, 2024
$\text{M}^{\text{3}}$: A Modular World Model over Streams of Tokens Paper • 2502.11537 • Published Feb 17
Improving and Benchmarking Offline Reinforcement Learning Algorithms Paper • 2306.00972 • Published Jun 1, 2023
Decoupling Representation and Classifier for Long-Tailed Recognition Paper • 1910.09217 • Published Oct 21, 2019
Trace Anything: Representing Any Video in 4D via Trajectory Fields Paper • 2510.13802 • Published Oct 15 • 30
Depth Anything 3: Recovering the Visual Space from Any Views Paper • 2511.10647 • Published Nov 13 • 95
Manipulation as in Simulation: Enabling Accurate Geometry Perception in Robots Paper • 2509.02530 • Published Sep 2 • 10
Video Depth Anything: Consistent Depth Estimation for Super-Long Videos Paper • 2501.12375 • Published Jan 21 • 23
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos Paper • 2501.09781 • Published Jan 16 • 28
Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation Paper • 2412.14015 • Published Dec 18, 2024 • 12
How Far is Video Generation from World Model: A Physical Law Perspective Paper • 2411.02385 • Published Nov 4, 2024 • 34
Loong: Generating Minute-level Long Videos with Autoregressive Language Models Paper • 2410.02757 • Published Oct 3, 2024 • 36
Bag of Tricks for Training Data Extraction from Language Models Paper • 2302.04460 • Published Feb 9, 2023 • 2