ChenShawn nielsr HF Staff commited on
Commit
3b31e4b
·
verified ·
1 Parent(s): ba077bd

Improve model card: Add pipeline tag, library name, and project description (#1)

Browse files

- Improve model card: Add pipeline tag, library name, and project description (18877196e5a4b8bd22d717ae6cd165f497554b0b)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +219 -4
README.md CHANGED
@@ -1,7 +1,222 @@
1
  ---
2
- license: apache-2.0
3
- language:
4
- - en
5
  base_model:
6
  - Qwen/Qwen2.5-VL-7B-Instruct
7
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
 
 
2
  base_model:
3
  - Qwen/Qwen2.5-VL-7B-Instruct
4
+ language:
5
+ - en
6
+ license: apache-2.0
7
+ pipeline_tag: image-text-to-text
8
+ library_name: transformers
9
+ ---
10
+
11
+ <div align="center">
12
+ <img src="docs/logo-deepeyes.jpg" alt="logo" height="100">
13
+ <h1 style="font-size: 32px; font-weight: bold;"> DeepEyes: Incentivizing “Thinking with Images” via Reinforcement Learning </h1>
14
+
15
+ <br>
16
+
17
+ <a href="https://arxiv.org/abs/2505.14362">
18
+ <img src="https://img.shields.io/badge/ArXiv-DeepEyes-brown?logo=arxiv" alt="Paper">
19
+ </a>
20
+ <a href="https://huggingface.co/datasets/ChenShawn/DeepEyes-Datasets-47k">
21
+ <img src="https://img.shields.io/badge/🤗 huggingface-Dataset-blue" alt="dataset">
22
+ </a>
23
+ <a href="https://huggingface.co/ChenShawn/DeepEyes-7B">
24
+ <img src="https://img.shields.io/badge/🤗 huggingface-Model-purple" alt="checkpoint">
25
+ </a>
26
+ <!-- <a href="https://visual-agent.github.io/">
27
+ <img src="https://img.shields.io/badge/-HomePage-black?logo=github" alt="checkpoint">
28
+ </a> -->
29
+ </div>
30
+
31
+ *\* Logo inspired by oracle bone character "eye".*
32
+
33
+ ## DeepEyes
34
+ Quote from [https://openai.com/index/thinking-with-images/](https://openai.com/index/thinking-with-images/)
35
+ > They don’t just see an image, they can integrate visual information directly into the reasoning chain.
36
+
37
+ ![](docs/fig2.png)
38
+
39
+ Key insights:
40
+ - The capability of DeepEyes to think with images is learned via end-to-end reinforcement learning. It is directly guided by outcome reward signals, requires no cold-start or supervised fine-tuning, and does not rely on specialized external model.
41
+ - Although there is no direct supervision applied for intermediate steps, both the grounding IoU and tool-calling accuracy was increased during the RL training stage.
42
+ ![](docs/fig_finding1.svg)
43
+ - The end-to-end RL training yields significant performance gain on high resolution benchmarks, and shows strong generalization for visual grounding, hallucination mitigation, and math problem solving tasks.
44
+ ![](docs/accuracy_comparison.svg)
45
+ - We observed an emergence of thinking pattern during RL training process, such as visual search for small objects, visual comparisons across different regions, using `image_zoom_in_tools` for answer verification, etc.
46
+ ![](docs/fig1_sc2.png)
47
+
48
+ ## Quick Start
49
+
50
+ ### Environment Setup
51
+
52
+ ```bash
53
+ # Follow the VeRL official installation procedure
54
+ pip install -e .
55
+
56
+ # Additional dependencies required by DeepEyes
57
+ bash scripts/install_deepeyes.sh
58
+ ```
59
+
60
+ ### Start Training
61
+
62
+ We use [Qwen-2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) as our foundation model for RL training. [Qwen-2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) is also supported.
63
+
64
+ We recommend using no less than 32 GPUs (4 nodes x 8 GPUs) for 7B training, and no less than 64 GPUs (8 nodes x 8 GPUs) for 32B training. For each node, we recommend using no less than 1200GB CPU RAM, as the high resolution images in V* and ArxivQA datasets can consume large amount of memory.
65
+
66
+ Step 1: Start a vllm serving of [Qwen-2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) for llm-as-a-judge verification.
67
+
68
+ ```bash
69
+ # download Qwen-2.5-72B-Instruct model
70
+ huggingface-cli download --resume-download https://huggingface.co/Qwen/Qwen2.5-72B-Instruct --local-dir /path/to/your/local/filedir --local-dir-use-symlinks False
71
+
72
+ # start vllm serving
73
+ vllm serve /path/to/your/local/filedir \
74
+ --port 18901 \
75
+ --gpu-memory-utilization 0.8 \
76
+ --max-model-len 32768 \
77
+ --tensor-parallel-size 8 \
78
+ --served-model-name "judge" \
79
+ --trust-remote-code \
80
+ --disable-log-requests
81
+ ```
82
+
83
+ Step 2: Build a ray cluster for all of the training nodes. Prepare data before starting training. Our training dataset can be downloaded from [huggingface](https://huggingface.co/datasets/ChenShawn/DeepEyes-Datasets-47k).
84
+
85
+
86
+ Step 3: Use one of the following scripts to start training.
87
+
88
+ ```bash
89
+ # your wandb access key here...
90
+ wandb login
91
+
92
+ # the IP and port for your Qwen-2.5-72B-Instruct vllm serving
93
+ export LLM_AS_A_JUDGE_BASE="http://your.vllm.machine.ip:18901/v1"
94
+
95
+ # umber of training nodes
96
+ export WORLD_SIZE=8
97
+
98
+ # config for 7B
99
+ bash examples/agent/final_merged_v1v8_thinklite.sh
100
+
101
+ # config for 32B
102
+ bash examples/agent/final_merged_v1v8_thinklite_32b.sh
103
+ ```
104
+
105
+ The training scripts use both [wandb](https://wandb.ai/site/) and [RL Logging Board](https://github.com/HarderThenHarder/RLLoggingBoard) (great work) to visualize the training dynamics.
106
+
107
+ ## Programming Guide
108
+
109
+ <details>
110
+ <summary>General Introduction for Codes</summary>
111
+
112
+ ### General Introduction
113
+
114
+ The code in this repository is a general agentic RL training framework based on [VeRL](https://github.com/volcengine/verl). Apart from DeepEyes, it is possible to perform any form of general agentic RL (multi-turn RL) training using our code implementation.
115
+
116
+ The code is designed to fulfill the following needs:
117
+ - **High efficient Agent RL training**: Agent rollout is asynchronous among all data parallel groups.
118
+ - **Allowing dynamic multi-modal input in agent observations**: This is the key for the RL training of "thinking with images" ability.
119
+ - **Allowing hybrid training for agent data with different tools and non-agentic data**: Tool usage is not hard-coded in rollout loop, instead, each sample can specify its own tool usage constraint via `env_name` field.
120
+ - **Support for algorithm**: PPO, GRPO, and reinforce++ are supported. We modified the advantage estimation, the policy loss masks, as well as the mrope for Qwen-VL models, to make it compatible with the interleaved structure of agentic multi-turn RL training.
121
+ - **Compatible for latest VeRL updates**: agentic RL training is implemented as a plugin for VeRL, making it easy to merge with the latest VeRL updates. Once you turn off the plugin switch, the functionality will be no different to the original version of VeRL.
122
+
123
+ </details>
124
+
125
+ <details>
126
+ <summary>Training on Customized Datasets</summary>
127
+
128
+ ### Use your own data
129
+ Add an additional field `env_name` to your data parquet files. The `env_name` of each sample should specify the which tool is allowed to use when performing agent rollout. For non-agent training data, leave the `env_name` to None or empty string.
130
+
131
+ For DeepEyes style training, for example, `env_name` should be specified as `visual_toolbox_v2`.
132
+
133
+ The rest part is no different to the original VeRL dataset format. Refer to [VeRL official documentation](https://verl.readthedocs.io/en/latest/index.html) for details.
134
+
135
+ </details>
136
+
137
+ <details>
138
+ <summary>Training with Customized Tools</summary>
139
+
140
+ ### Implement your own tools
141
+ Implement your tool function in a new class that inherents `ToolBase` class in [verl/workers/agent/tool_envs.py](verl/workers/agent/tool_envs.py) as its base class.
142
+
143
+ The subclass MUST include `name` variable, whose value corresponds to the `env_name` field in training data parquet files.
144
+
145
+ Implement the `execute` and `reset` functions. Here is an simple example:
146
+
147
+ Example code:
148
+ ```python
149
+ class CustomTool(ToolBase):
150
+ name = "custom_tool_v0"
151
+
152
+ def __init__(self, _name, _desc, _params, **kwargs):
153
+ super().__init__(name=self.name)
154
+
155
+ def execute(self, action_string: str, **kwargs) -> tuple:
156
+ """
157
+ Execute the tool functionality based on the LLM generated text.
158
+ This function is called EACH TIME after vllm.generate
159
+
160
+ Args:
161
+ action_string: The string generated by LLM via vllm.generate.
162
+
163
+ Returns:
164
+ observation: The structured observation with the processed image.
165
+ reward: setting a non-zero value if you want to assign a reward to the LAST GENERATED TOKEN in the intermediate steps.
166
+ done: Whether the episode is terminated.
167
+ info: Additional info.
168
+ """
169
+ pass
170
+
171
+ def reset(self, raw_prompt, multi_modal_data, origin_multi_modal_data, **kwargs):
172
+ """
173
+ This function is called ONLY ONCE when initializing the tools
174
+
175
+ Args:
176
+ raw_prompt: setting config param `data.return_raw_chat=True` to get raw prompt input.
177
+ multi_modal_data: refer to vllm documentation for details https://docs.vllm.ai/en/stable/features/multimodal_inputs.html
178
+ origin_multi_modal_data: VLM vision processor can modify the original images, typically by resizing, when they are too small or too large, use this param if you want to get access to the unmodified vision input.
179
+ """
180
+ pass
181
+ ```
182
+
183
+ Refer to [verl/workers/agent/envs/mm_process_engine/visual_toolbox_v2.py](verl/workers/agent/envs/mm_process_engine/visual_toolbox_v2.py) as an example for the `image_zoom_in_tool` in DeepEyes.
184
+
185
+ **Important**: Import your custom tool in [verl/workers/agent/__init__.py](verl/workers/agent/__init__.py)
186
+
187
+ ```python
188
+ from .envs.your_custom_tool import CustomTool
189
+ ```
190
+
191
+ </details>
192
+
193
+ <details>
194
+ <summary>Using latest VeRL code</summary>
195
+
196
+ ### Using latest VeRL code
197
+ In case you want to use the latest VeRL code for training.
198
+
199
+ ```bash
200
+ git remote add official https://github.com/volcengine/verl.git
201
+ git pull official main
202
+ ```
203
+
204
+ </details>
205
+
206
+ ## Licence
207
+
208
+ This project is released under [Apache licence](./LICENSE).
209
+
210
+ ## Citation
211
+
212
+ ```
213
+ @article{zheng2025deepeyesincentivizingthinkingimages,
214
+ title={DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning},
215
+ author={Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, Xing Yu},
216
+ year={2025},
217
+ eprint={2505.14362},
218
+ archivePrefix={arXiv},
219
+ primaryClass={cs.CV},
220
+ url={https://arxiv.org/abs/2505.14362},
221
+ }
222
+ ```