New Papers
updated
R1-Onevision: Advancing Generalized Multimodal Reasoning through
Cross-Modal Formalization
Paper
• 2503.10615
• Published
• 17
UniGoal: Towards Universal Zero-shot Goal-oriented Navigation
Paper
• 2503.10630
• Published
• 6
Search-R1: Training LLMs to Reason and Leverage Search Engines with
Reinforcement Learning
Paper
• 2503.09516
• Published
• 38
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through
Two-Stage Rule-Based RL
Paper
• 2503.07536
• Published
• 88
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning
Paper
• 2503.07572
• Published
• 48
SegAgent: Exploring Pixel Understanding Capabilities in MLLMs by
Imitating Human Annotator Trajectories
Paper
• 2503.08625
• Published
• 27
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature
Extraction
Paper
• 2503.03734
• Published
• 1
Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and
Beyond
Paper
• 2503.10460
• Published
• 30
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for
Embodied Interactive Tasks
Paper
• 2503.21696
• Published
• 23
ViLBench: A Suite for Vision-Language Process Reward Modeling
Paper
• 2503.20271
• Published
• 7
Gemini Robotics: Bringing AI into the Physical World
Paper
• 2503.20020
• Published
• 31
Qwen2.5-Omni Technical Report
Paper
• 2503.20215
• Published
• 170
Dita: Scaling Diffusion Transformer for Generalist
Vision-Language-Action Policy
Paper
• 2503.19757
• Published
• 51
AppAgentX: Evolving GUI Agents as Proficient Smartphone Users
Paper
• 2503.02268
• Published
• 11
Unified Reward Model for Multimodal Understanding and Generation
Paper
• 2503.05236
• Published
• 123
R1-Searcher: Incentivizing the Search Capability in LLMs via
Reinforcement Learning
Paper
• 2503.05592
• Published
• 27
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Paper
• 2503.13444
• Published
• 17
Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers
Paper
• 2503.11579
• Published
• 21
VisualPRM: An Effective Process Reward Model for Multimodal Reasoning
Paper
• 2503.10291
• Published
• 36
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
Paper
• 2503.15558
• Published
• 50
DeepPerception: Advancing R1-like Cognitive Visual Perception in MLLMs
for Knowledge-Intensive Visual Grounding
Paper
• 2503.12797
• Published
• 32
Advances and Challenges in Foundation Agents: From Brain-Inspired
Intelligence to Evolutionary, Collaborative, and Safe Systems
Paper
• 2504.01990
• Published
• 303
Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs
Paper
• 2504.00072
• Published
• 6
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement
Learning on the Base Model
Paper
• 2503.24290
• Published
• 62
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language
Models via Mixture-of-LoRAs
Paper
• 2503.01743
• Published
• 89
VEM: Environment-Free Exploration for Training GUI Agent with Value
Environment Model
Paper
• 2502.18906
• Published
• 12
Qwen2.5-VL Technical Report
Paper
• 2502.13923
• Published
• 214
Agentic Knowledgeable Self-awareness
Paper
• 2504.03553
• Published
• 27
Slow-Fast Architecture for Video Multi-Modal Large Language Models
Paper
• 2504.01328
• Published
• 7
Paper
• 2504.07491
• Published
• 137
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual
Reasoning Self-Improvement
Paper
• 2504.07934
• Published
• 21
OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training
Tokens
Paper
• 2504.07096
• Published
• 77
Caption Anything in Video: Fine-grained Object-centric Captioning via
Spatiotemporal Multimodal Prompting
Paper
• 2504.05541
• Published
• 15
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement
Fine-Tuning
Paper
• 2504.06958
• Published
• 13
Efficient Reinforcement Finetuning via Adaptive Curriculum Learning
Paper
• 2504.05520
• Published
• 11
ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning
Paper
• 2503.22738
• Published
• 17
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated
Agent-Human Interplay
Paper
• 2504.03601
• Published
• 17
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge
Refinement
Paper
• 2504.03561
• Published
• 18
Towards Trustworthy GUI Agents: A Survey
Paper
• 2503.23434
• Published
• 21
Sleep-time Compute: Beyond Inference Scaling at Test-time
Paper
• 2504.13171
• Published
• 15
BitNet b1.58 2B4T Technical Report
Paper
• 2504.12285
• Published
• 83
InternVL3: Exploring Advanced Training and Test-Time Recipes for
Open-Source Multimodal Models
Paper
• 2504.10479
• Published
• 306
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent
Trajectories
Paper
• 2504.08942
• Published
• 28
TinyLLaVA-Video-R1: Towards Smaller LMMs for Video Reasoning
Paper
• 2504.09641
• Published
• 16
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Paper
• 2504.10449
• Published
• 15
SpecReason: Fast and Accurate Inference-Time Compute via Speculative
Reasoning
Paper
• 2504.07891
• Published
• 5
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Paper
• 2504.11536
• Published
• 63
SFT or RL? An Early Investigation into Training R1-Like Reasoning Large
Vision-Language Models
Paper
• 2504.11468
• Published
• 30
Skywork R1V2: Multimodal Hybrid Reinforcement Learning for Reasoning
Paper
• 2504.16656
• Published
• 57
Paper2Code: Automating Code Generation from Scientific Papers in Machine
Learning
Paper
• 2504.17192
• Published
• 123
Process Reward Models That Think
Paper
• 2504.16828
• Published
• 18
Describe Anything: Detailed Localized Image and Video Captioning
Paper
• 2504.16072
• Published
• 64
Progent: Programmable Privilege Control for LLM Agents
Paper
• 2504.11703
• Published
• 6
TTRL: Test-Time Reinforcement Learning
Paper
• 2504.16084
• Published
• 120
Learning to Reason under Off-Policy Guidance
Paper
• 2504.14945
• Published
• 88
ToolRL: Reward is All Tool Learning Needs
Paper
• 2504.13958
• Published
• 49
LearnAct: Few-Shot Mobile GUI Agent with a Unified Demonstration
Benchmark
Paper
• 2504.13805
• Published
• 11
AM-Thinking-v1: Advancing the Frontier of Reasoning at 32B Scale
Paper
• 2505.08311
• Published
• 19
Skywork-VL Reward: An Effective Reward Model for Multimodal
Understanding and Reasoning
Paper
• 2505.07263
• Published
• 30
StreamBridge: Turning Your Offline Video Large Language Model into a
Proactive Streaming Assistant
Paper
• 2505.05467
• Published
• 13
Visual Planning: Let's Think Only with Images
Paper
• 2505.11409
• Published
• 57
Scaling Computer-Use Grounding via User Interface Decomposition and
Synthesis
Paper
• 2505.13227
• Published
• 45
Efficient Agent Training for Computer Use
Paper
• 2505.13909
• Published
• 44
One RL to See Them All: Visual Triple Unified Reinforcement Learning
Paper
• 2505.18129
• Published
• 62
Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement
Fine-Tuning of Large Language Models
Paper
• 2505.17826
• Published
• 10
Interactive Post-Training for Vision-Language-Action Models
Paper
• 2505.17016
• Published
• 6
ARM: Adaptive Reasoning Model
Paper
• 2505.20258
• Published
• 45
Visual Embodied Brain: Let Multimodal Large Language Models See, Think,
and Control in Spaces
Paper
• 2506.00123
• Published
• 35
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
Paper
• 2505.23762
• Published
• 45
Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV
Cache and Parallel Decoding
Paper
• 2505.22618
• Published
• 45
Paper
• 2505.23419
• Published
• 21
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in
Large Language Models
Paper
• 2505.24864
• Published
• 144
ReasonGen-R1: CoT for Autoregressive Image generation models through SFT
and RL
Paper
• 2505.24875
• Published
• 10
Reflect, Retry, Reward: Self-Improving LLMs via Reinforcement Learning
Paper
• 2505.24726
• Published
• 277
WebSailor: Navigating Super-human Reasoning for Web Agent
Paper
• 2507.02592
• Published
• 124
MemOS: A Memory OS for AI System
Paper
• 2507.03724
• Published
• 159
RoboBrain 2.0 Technical Report
Paper
• 2507.02029
• Published
• 35
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
Paper
• 2506.11763
• Published
• 74
AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions
Paper
• 2506.09038
• Published
• 6
GTA1: GUI Test-time Scaling Agent
Paper
• 2507.05791
• Published
• 27
Budget-Aware Tool-Use Enables Effective Agent Scaling
Paper
• 2511.17006
• Published
• 33
Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO
Paper
• 2511.13288
• Published
• 19
EvoVLA: Self-Evolving Vision-Language-Action Model
Paper
• 2511.16166
• Published
• 6
WMPO: World Model-based Policy Optimization for Vision-Language-Action Models
Paper
• 2511.09515
• Published
• 20
TiDAR: Think in Diffusion, Talk in Autoregression
Paper
• 2511.08923
• Published
• 128
Robot Learning from a Physical World Model
Paper
• 2511.07416
• Published
• 32
Long Grounded Thoughts: Distilling Compositional Visual Reasoning Chains at Scale
Paper
• 2511.05705
• Published
• 8
Real-Time Reasoning Agents in Evolving Environments
Paper
• 2511.04898
• Published
• 13
VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual
Representation
Paper
• 2511.02778
• Published
• 102
World Simulation with Video Foundation Models for Physical AI
Paper
• 2511.00062
• Published
• 44
Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in
Web Games
Paper
• 2510.26298
• Published
• 46
Supervised Reinforcement Learning: From Expert Trajectories to Step-wise
Reasoning
Paper
• 2510.25992
• Published
• 48
Search More, Think Less: Rethinking Long-Horizon Agentic Search for Efficiency and Generalization
Paper
• 2602.22675
• Published
• 18
OmniGAIA: Towards Native Omni-Modal AI Agents
Paper
• 2602.22897
• Published
• 49