VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
Abstract
Training Vision-Language Models (VLMs) for Graphical User Interfaces (GUI) agents via Reinforcement Learning (RL) faces critical challenges: environment-based RL requires costly interactions, while environment-free methods struggle with distribution shift and reward generalization. We propose an environment-free RL framework that decouples value estimation from policy optimization by leveraging a pretrained Value Environment Model (VEM). VEM predicts state-action values directly from offline data, distilling human-like priors about GUI interaction outcomes without requiring next-state prediction or environmental feedback. This avoids compounding errors and enhances resilience to UI changes by focusing on semantic reasoning (e.g., Does this action advance the user's goal?). The framework operates in two stages: (1) pretraining VEM to estimate long-term action utilities and (2) guiding policy exploration with frozen VEM signals, enabling layout-agnostic GUI automation. Evaluated on Android-in-the-Wild benchmarks, VEM achieves state-of-the-art performance in both offline and online settings, outperforming environment-free baselines significantly and matching environment-based approaches without interaction costs. Importantly, VEM demonstrates that semantic-aware value estimation can achieve comparable performance with online-trained methods.
Community
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Lean and Mean: Decoupled Value Policy Optimization with Global Value Guidance (2025)
- InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection (2025)
- Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning (2025)
- QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search (2025)
- The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning (2025)
- VSC-RL: Advancing Autonomous Vision-Language Agents with Variational Subgoal-Conditioned Reinforcement Learning (2025)
- Mobile-Agent-V: Learning Mobile Device Operation Through Video-Guided Multi-Agent Collaboration (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper