Jakiro: Boosting Speculative Decoding with Decoupled Multi-Head via MoE Paper • 2502.06282 • Published Feb 10 • 5
FAST: Efficient Action Tokenization for Vision-Language-Action Models Paper • 2501.09747 • Published Jan 16 • 23
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling Paper • 2412.05271 • Published Dec 6, 2024 • 141
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis Paper • 2412.04431 • Published Dec 5, 2024 • 18
GRAPE: Generalizing Robot Policy via Preference Alignment Paper • 2411.19309 • Published Nov 28, 2024 • 44
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published Aug 22, 2024 • 126
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model Paper • 2408.11039 • Published Aug 20, 2024 • 60
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD Paper • 2404.06512 • Published Apr 9, 2024 • 30
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection Paper • 2403.03507 • Published Mar 6, 2024 • 188
ShortGPT: Layers in Large Language Models are More Redundant Than You Expect Paper • 2403.03853 • Published Mar 6, 2024 • 64
Enhancing Vision-Language Pre-training with Rich Supervisions Paper • 2403.03346 • Published Mar 5, 2024 • 17
Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision Paper • 2312.09390 • Published Dec 14, 2023 • 33