-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 45 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 24
Collections
Discover the best community collections!
Collections including paper arxiv:2505.02707
-
Can Large Language Models Understand Context?
Paper • 2402.00858 • Published • 24 -
OLMo: Accelerating the Science of Language Models
Paper • 2402.00838 • Published • 84 -
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 152 -
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
Paper • 2401.17072 • Published • 24
-
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models
Paper • 2504.20157 • Published • 38 -
The Leaderboard Illusion
Paper • 2504.20879 • Published • 70 -
ReasonIR: Training Retrievers for Reasoning Tasks
Paper • 2504.20595 • Published • 55 -
RM-R1: Reward Modeling as Reasoning
Paper • 2505.02387 • Published • 79
-
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper • 2504.11536 • Published • 61 -
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Paper • 2505.24726 • Published • 271 -
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper • 2503.12605 • Published • 36 -
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Paper • 2506.13585 • Published • 263
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
-
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
Paper • 2505.02707 • Published • 86 -
MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing
Paper • 2505.02823 • Published • 5 -
PixelHacker: Image Inpainting with Structural and Semantic Consistency
Paper • 2504.20438 • Published • 44 -
Improving Editability in Image Generation with Layer-wise Memory
Paper • 2505.01079 • Published • 29
-
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Paper • 2501.18585 • Published • 62 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 153 -
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
Paper • 2503.15265 • Published • 47 -
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper • 2503.15558 • Published • 51
-
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Paper • 2412.15322 • Published • 20 -
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
Paper • 2505.02707 • Published • 86 -
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
Paper • 2505.02625 • Published • 22 -
Fast Text-to-Audio Generation with Adversarial Post-Training
Paper • 2505.08175 • Published • 23
-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 29 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 13 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 45 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 24
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 189 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 17 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 51 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 43
-
Can Large Language Models Understand Context?
Paper • 2402.00858 • Published • 24 -
OLMo: Accelerating the Science of Language Models
Paper • 2402.00838 • Published • 84 -
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 152 -
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity
Paper • 2401.17072 • Published • 24
-
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
Paper • 2505.02707 • Published • 86 -
MUSAR: Exploring Multi-Subject Customization from Single-Subject Dataset via Attention Routing
Paper • 2505.02823 • Published • 5 -
PixelHacker: Image Inpainting with Structural and Semantic Consistency
Paper • 2504.20438 • Published • 44 -
Improving Editability in Image Generation with Layer-wise Memory
Paper • 2505.01079 • Published • 29
-
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models
Paper • 2504.20157 • Published • 38 -
The Leaderboard Illusion
Paper • 2504.20879 • Published • 70 -
ReasonIR: Training Retrievers for Reasoning Tasks
Paper • 2504.20595 • Published • 55 -
RM-R1: Reward Modeling as Reasoning
Paper • 2505.02387 • Published • 79
-
Thoughts Are All Over the Place: On the Underthinking of o1-Like LLMs
Paper • 2501.18585 • Published • 62 -
RWKV-7 "Goose" with Expressive Dynamic State Evolution
Paper • 2503.14456 • Published • 153 -
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
Paper • 2503.15265 • Published • 47 -
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper • 2503.15558 • Published • 51
-
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper • 2504.11536 • Published • 61 -
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Paper • 2505.24726 • Published • 271 -
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Paper • 2503.12605 • Published • 36 -
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
Paper • 2506.13585 • Published • 263
-
Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Paper • 2412.15322 • Published • 20 -
Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play
Paper • 2505.02707 • Published • 86 -
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis
Paper • 2505.02625 • Published • 22 -
Fast Text-to-Audio Generation with Adversarial Post-Training
Paper • 2505.08175 • Published • 23