Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning
Abstract
Pref-GRPO, a pairwise preference reward-based GRPO method, enhances text-to-image generation by mitigating reward hacking and improving stability, while UniGenBench provides a comprehensive benchmark for evaluating T2I models.
Recent advancements highlight the importance of GRPO-based reinforcement learning methods and benchmarking in enhancing text-to-image (T2I) generation. However, current methods using pointwise reward models (RM) for scoring generated images are susceptible to reward hacking. We reveal that this happens when minimal score differences between images are amplified after normalization, creating illusory advantages that drive the model to over-optimize for trivial gains, ultimately destabilizing the image generation process. To address this, we propose Pref-GRPO, a pairwise preference reward-based GRPO method that shifts the optimization objective from score maximization to preference fitting, ensuring more stable training. In Pref-GRPO, images are pairwise compared within each group using preference RM, and the win rate is used as the reward signal. Extensive experiments demonstrate that PREF-GRPO differentiates subtle image quality differences, providing more stable advantages and mitigating reward hacking. Additionally, existing T2I benchmarks are limited by coarse evaluation criteria, hindering comprehensive model assessment. To solve this, we introduce UniGenBench, a unified T2I benchmark comprising 600 prompts across 5 main themes and 20 subthemes. It evaluates semantic consistency through 10 primary and 27 sub-criteria, leveraging MLLM for benchmark construction and evaluation. Our benchmarks uncover the strengths and weaknesses of both open and closed-source T2I models and validate the effectiveness of Pref-GRPO.
Community
🌟Project page: https://codegoat24.github.io/UnifiedReward/Pref-GRPO
📖Paper: https://arxiv.org/pdf/2508.20751
💡Pref-GRPO Github: https://github.com/CodeGoat24/Pref-GRPO
💥UniGenBench Github: https://github.com/CodeGoat24/UniGenBench
🤗Leaderboard: https://huggingface.co/spaces/CodeGoat24/UniGenBench_Leaderboard
🤗Model: https://huggingface.co/CodeGoat24/FLUX.1-dev-PrefGRPO
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance for Text-to-Image Generation (2025)
- TempFlow-GRPO: When Timing Matters for GRPO in Flow Models (2025)
- OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning (2025)
- Enhancing Reward Models for High-quality Image Generation: Beyond Text-Image Alignment (2025)
- CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization (2025)
- Multimodal LLMs as Customized Reward Models for Text-to-Image Generation (2025)
- Posterior-GRPO: Rewarding Reasoning Processes in Code Generation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Thanks!
cool paper