xywang626 commited on
Commit
b3b17c6
·
verified ·
1 Parent(s): 19e75c2

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +304 -3
README.md CHANGED
@@ -1,3 +1,304 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # 1. OpenCUA: Open Foundations for Computer-Use Agents
6
+ 🏠 [Project Homepage](https://opencua.xlang.ai/)
7
+
8
+ OpenCUA is a comprehensive open-source framework for scaling computer-use agent (CUA) data and foundation models.
9
+ Our framework consists of:
10
+ (1) an annotation infrastructure that seamlessly captures human computer-use demonstrations;
11
+ (2) AGENTNET, the first large-scale computer-use task dataset spanning 3 operating systems and 200+ applications and websites;
12
+ (3) a scalable pipeline that transforms demonstrations into state–action pairs with reflective long Chain-of-Thought reasoning that sustain robust performance gains as data scales.
13
+ Our end-to-end agent models demonstrate strong performance across CUA benchmarks.
14
+ In particular, OpenCUA-32B achieves an average success rate of **34.8%** on OSWorld-Verified, establishing a new state-of-the-art (SOTA) among open-source models and surpassing OpenAI CUA (GPT-4o).
15
+
16
+
17
+ ## 2. Model Variants
18
+
19
+ We release four model variants with different capabilities and computational requirements:
20
+
21
+ - **OpenCUA-A3B**: Efficient 3B active parameter MoE model (16B total parameters) based on Kimi-VL-A3B
22
+ - **OpenCUA-Qwen2-7B**: Based on Qwen2-VL-7B with enhanced CUA capabilities
23
+ - **OpenCUA-7B**: Our 7B model based on Qwen2.5-VL-7B
24
+ - **OpenCUA-32B**: Large-scale 32B model based on Qwen2.5-VL-32B for maximum performance
25
+
26
+ ## 3. Key Features
27
+
28
+ - **Superior Computer-Use Capablity**: Able to execute multi-step computer-use actions with effective planning and reasoning
29
+ - **Multi-OS Support**: Trained on demonstrations across Ubuntu, Windows, and macOS
30
+ - **Visual Grounding**: Strong GUI element recognition and spatial reasoning capabilities
31
+ - **Multi-Image Context**: Processes up to 3 screenshot history for better context understanding
32
+ - **Reflective Reasoning**: Enhanced with reflective long Chain-of-Thought that identifies errors and provides corrective reasoning
33
+
34
+
35
+ ## 4. Performance
36
+
37
+ ### Online Agent Evaluation
38
+ OpenCUA models achieves strong performance on **[OSWorld-Verified](https://os-world.github.io/)**.
39
+ OPENCUA-32B achieves the best performance among all open-source models with an average success rate of 34.8%, outperforming prior baselines by large margins.
40
+ It also closes the gap to proprietary Claude models.
41
+
42
+ | **Model** | **15 Steps** | **50 Steps** | **100 Steps** |
43
+ |-------------------------------|:--------:|:--------:|:---------:|
44
+ | **Proprietary** | | | |
45
+ | OpenAI CUA | 26.0 | 31.3 | 31.4 |
46
+ | Seed 1.5-VL | 27.9 | — | 34.1 |
47
+ | Claude 3.7 Sonnet | 27.1 | 35.8 | 35.9 |
48
+ | Claude 4 Sonnet | 31.2 | 43.9 | 41.5 |
49
+ | **Open-Source** | | | |
50
+ | Qwen 2.5-VL-32B-Instruct | 3.0 | — | 3.9 |
51
+ | Qwen 2.5-VL-72B-Instruct | 4.4 | — | 5.0 |
52
+ | Kimi-VL-A3B | 9.7 | — | 10.3 |
53
+ | UI-TARS-72B-DPO | 24.0 | 25.8 | 27.1 |
54
+ | UI-TARS-1.5-7B | 24.5 | 27.3 | 27.4 |
55
+ | OpenCUA-7B *(Ours)* | 24.3 | 27.9 | 26.6 |
56
+ | **OpenCUA-32B *(Ours)*** | **29.7** | **34.1** | **34.8** |
57
+
58
+ *OpenCUA scores are the mean of 3 independent runs.*
59
+
60
+ ### GUI Grounding Performance
61
+
62
+ | **Model** | **OSWorld-G** | **ScreenSpot-V2** | **ScreenSpot-Pro** |
63
+ |-------|-----------|---------------|----------------|
64
+ | Qwen2.5-VL-7B | 31.4 | 88.8 | 27.6 |
65
+ | Qwen2.5-VL-32B | 46.5 | 87.0 | 39.4 |
66
+ | UI-TARS-72B | 57.1 | 90.3 | 38.1 |
67
+ | **OpenCUA-A3B** | 48.6 | 91.4 | 28.5 |
68
+ | **OpenCUA-7B** | 45.7 | 88.5 | 23.7 |
69
+ | **OpenCUA-2.5-7B** | 55.3 | 92.3 | 50.0 |
70
+ | **OpenCUA-2.5-32B** | **59.6** | **93.4** | **55.3** |
71
+
72
+ ### AgentNetBench (Offline Evaluation)
73
+
74
+ | **Model** | **Coordinate Actions** | **Content Actions** | **Function Actions** | **Average** |
75
+ |-------|-------------------|-----------------|------------------|---------|
76
+ | Qwen2.5-VL-7B | 50.7 | 40.8 | 3.1 | 48.0 |
77
+ | Qwen2.5-VL-32B | 66.6 | 47.2 | 41.5 | 64.8 |
78
+ | Qwen2.5-VL-72B | 67.2 | 52.6 | 50.5 | 67.0 |
79
+ | OpenAI CUA | 71.7 | **57.3** | **80.0** | 73.1 |
80
+ | **OpenCUA-2.5-7B** | 75.4 | 46.4 | 53.6 | 71.0 |
81
+ | **OpenCUA-2.5-32B** | **78.7** | 46.0 | 55.2 | **73.2** |
82
+
83
+ ## 5. Usage
84
+ <div style="border-left: 6px solid #f28c28; background: #fff8e6; padding: 12px 16px; margin: 16px 0;">
85
+ <strong>⚠️ Important for Qwen-based Models (OpenCUA-Qwen2-7B, OpenCUA-7B, OpenCUA-32B):</strong>
86
+
87
+ To align with our training infrastructure, we have modified the model in two places:
88
+ <ul style="margin-top: 8px;">
89
+ <li>1. Multimodal Rotary Position Embedding (M-RoPE) has been replaced with 1D RoPE</strong>.</li>
90
+ <li>2. Using the same Tokenizer and ChatTemplate as Kimi-VL.</li>
91
+ <li>Do not use the default transformers and vllm classes to load the model.</li>
92
+ </ul>
93
+ </div>
94
+
95
+
96
+ ### 5.1 Installation
97
+
98
+ First, install the required dependencies:
99
+
100
+ ```bash
101
+ pip install transformers==4.53
102
+ pip install torch pillow
103
+ ```
104
+
105
+ ### 5.2 Download the Model Weights
106
+ ```bash
107
+ from huggingface_hub import snapshot_download
108
+ snapshot_download(
109
+ repo_id="xlangai/OpenCUA-32B",
110
+ local_dir="OpenCUA-32B",
111
+ local_dir_use_symlinks=False
112
+ )
113
+ ```
114
+
115
+ ### 5.3 Basic Usage for GUI Grounding
116
+
117
+ The following code demonstrates how to use OpenCUA models for GUI grounding tasks:
118
+
119
+ ```python
120
+ import base64
121
+ import torch
122
+ from transformers import AutoTokenizer, AutoModel, AutoImageProcessor
123
+ from PIL import Image
124
+ import json
125
+
126
+ def encode_image(image_path: str) -> str:
127
+ """Encode image to base64 string for model input."""
128
+ with open(image_path, "rb") as f:
129
+ return base64.b64encode(f.read()).decode()
130
+
131
+ def load_opencua_model(model_path: str):
132
+ """Load OpenCUA model, tokenizer, and image processor."""
133
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
134
+ model = AutoModel.from_pretrained(
135
+ model_path,
136
+ torch_dtype="auto",
137
+ device_map="auto",
138
+ trust_remote_code=True
139
+ )
140
+ image_processor = AutoImageProcessor.from_pretrained(model_path, trust_remote_code=True)
141
+
142
+ return model, tokenizer, image_processor
143
+
144
+ def create_grounding_messages(image_path: str, instruction: str):
145
+ """Create chat messages for GUI grounding task."""
146
+ system_prompt = (
147
+ "You are a GUI agent. You are given a task and a screenshot of the screen. "
148
+ "You need to perform a series of pyautogui actions to complete the task."
149
+ )
150
+
151
+ messages = [
152
+ {"role": "system", "content": system_prompt},
153
+ {
154
+ "role": "user",
155
+ "content": [
156
+ {"type": "image", "image": f"data:image/png;base64,{encode_image(image_path)}"},
157
+ {"type": "text", "text": instruction},
158
+ ],
159
+ },
160
+ ]
161
+ return messages
162
+
163
+ def run_inference(model, tokenizer, image_processor, messages, image_path):
164
+ """Run inference on the model."""
165
+ # Prepare text input
166
+ input_ids = tokenizer.apply_chat_template(
167
+ messages, tokenize=True, add_generation_prompt=True
168
+ )
169
+ input_ids = torch.tensor([input_ids]).to(model.device)
170
+
171
+ # Prepare image input
172
+ image = Image.open(image_path).convert('RGB')
173
+ image_info = image_processor.preprocess(images=[image])
174
+ pixel_values = torch.tensor(image_info['pixel_values']).to(
175
+ dtype=torch.bfloat16, device=model.device
176
+ )
177
+ grid_thws = torch.tensor(image_info['image_grid_thw'])
178
+
179
+ # Generate response
180
+ with torch.no_grad():
181
+ generated_ids = model.generate(
182
+ input_ids,
183
+ pixel_values=pixel_values,
184
+ grid_thws=grid_thws,
185
+ max_new_tokens=512,
186
+ temperature=0
187
+ )
188
+
189
+ # Decode output
190
+ prompt_len = input_ids.shape[1]
191
+ generated_ids = generated_ids[:, prompt_len:]
192
+ output_text = tokenizer.batch_decode(
193
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
194
+ )[0]
195
+
196
+ return output_text
197
+
198
+ # Example usage
199
+ model_path = "OpenCUA/OpenCUA-32B" # or other model variants
200
+ image_path = "screenshot.png"
201
+ instruction = "Click on the submit button"
202
+
203
+ # Load model
204
+ model, tokenizer, image_processor = load_opencua_model(model_path)
205
+
206
+ # Create messages and run inference
207
+ messages = create_grounding_messages(image_path, instruction)
208
+ result = run_inference(model, tokenizer, image_processor, messages, image_path)
209
+
210
+ print("Model output:", result)
211
+ ```
212
+
213
+ <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
214
+ <em>Expected result: ```python\npyautogui.click(x=1443, y=343)\n```
215
+ </div>
216
+
217
+
218
+ ### 5.4 Important Notes on Coordinate Systems
219
+ <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
220
+ <ul style="margin: 0;">
221
+ <li><strong><code>OpenCUA/OpenCUA-A3B</code></strong> – Relative coordinates <em>(not supported in this code)</em></li>
222
+ <li><strong><code>OpenCUA/OpenCUA-Qwen2-7B</code></strong> – Relative coordinates</li>
223
+ <li><strong><code>OpenCUA/OpenCUA-7B</code></strong> – Absolute coordinates</li>
224
+ <li><strong><code>OpenCUA/OpenCUA-32B</code></strong> – Absolute coordinates</li>
225
+ </ul>
226
+ </div>
227
+
228
+ **OpenCUA models use different coordinate systems depending on the base model:**
229
+
230
+ - **OpenCUA-Qwen2-7B**: Outputs **relative coordinates** (0.0 to 1.0 range)
231
+ ```python
232
+ # Example output: pyautogui.click(x=0.5, y=0.3)
233
+ # x=0.5 means 50% from left edge, y=0.3 means 30% from top edge
234
+
235
+ # Convert to absolute coordinates:
236
+ def qwen2_relative_to_absolute(rel_x, rel_y, original_width, original_height):
237
+ abs_x = int(rel_x * original_width)
238
+ abs_y = int(rel_y * original_height)
239
+ return abs_x, abs_y
240
+ ```
241
+
242
+ - **OpenCUA-7B and OpenCUA-32B** (Qwen2.5-based): Output **absolute coordinates** after smart resize
243
+ ```python
244
+ # Example output: pyautogui.click(x=960, y=324)
245
+ # These are coordinates on the smart-resized image, not the original image
246
+
247
+ # Convert to original image coordinates:
248
+ # Please refer to the smart_resize function in: https://github.com/huggingface/transformers/blob/67ddc82fbc7e52c6f42a395b4a6d278c55b77a39/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L55
249
+ def qwen25_smart_resize_to_absolute(model_x, model_y, original_width, original_height):
250
+ # First, calculate the smart-resized dimensions
251
+ resized_height, resized_width = smart_resize(original_height, original_width, factor = 28, min_pixels = 3136, max_pixels = 12845056)
252
+
253
+ # Convert model output to relative coordinates on original image
254
+ rel_x = model_x / resized_width
255
+ rel_y = model_y / resized_height
256
+
257
+ # Then convert to absolute coordinates on original image
258
+ abs_x = int(rel_x * original_width)
259
+ abs_y = int(rel_y * original_height)
260
+ return abs_x, abs_y
261
+ ```
262
+
263
+ <div style="border-left: 6px solid #9ca3af; background: #f5f5f5; padding: 12px 16px; margin: 16px 0;">
264
+ <strong>Understanding Smart Resize for Qwen2.5-based Models:</strong>
265
+ <p style="margin: 8px 0 0;">
266
+ The Qwen2.5-VL models use a “smart resize” preprocessing that maintains aspect ratio while fitting within pixel constraints.
267
+ For coordinate conversion, you need the smart resize function from the
268
+ <a href="https://github.com/QwenLM/Qwen2.5-VL/blob/d2240f11656bfe404b9ba56db4e51cd09f522ff1/qwen-vl-utils/src/qwen_vl_utils/vision_process.py#L60">
269
+ official Qwen2.5-VL implementation</a>.
270
+ </p>
271
+ </div>
272
+
273
+
274
+
275
+
276
+ ## License
277
+
278
+ This project is licensed under the MIT License - see the LICENSE file for details.
279
+
280
+ ## Research Use and Disclaimer
281
+
282
+ This software is intended for **research and educational purposes only**.
283
+
284
+ ### Prohibited Uses
285
+ - The model may **not** be used for any purpose or activity that violates applicable laws or regulations in any jurisdiction
286
+ - Use for illegal, unethical, or harmful activities is strictly prohibited
287
+
288
+ ### Disclaimer
289
+ - The authors, contributors, and copyright holders are **not responsible** for any illegal, unethical, or harmful use of the Software, nor for any direct or indirect damages resulting from such use
290
+ - Use of the "OpenCUA" name, logo, or trademarks does **not** imply any endorsement or affiliation unless separate written permission is obtained
291
+ - Users are solely responsible for ensuring their use complies with applicable laws and regulations
292
+
293
+ ## Citation
294
+
295
+ If you use OpenCUA in your research, please cite our work:
296
+
297
+ ```bibtex
298
+ @article{OpenCUA2025,
299
+ title={OpenCUA: Open Foundations for Computer-Use Agents},
300
+ author={Wang, Xinyuan and Wang, Bowen and Lu, Dunjie and Yang, Junlin and Xie, Tianbao and Wang, Junli and Deng, Jiaqi and Guo, Xiaole and Xu, Yiheng and Wu, Chen Henry and Shen, Zhennan and Li, Zhuokai and Li, Ryan and Li, Xiaochuan and Chen, Junda and Zheng, Boyuan and Li, Peihang and Lei, Fangyu and Cao, Ruisheng and Fu, Yeqiao and Shin, Dongchan and Shin, Martin and Hu, Jiarui and Wang, Yuyan and Chen, Jixuan and Ye, Yuxiao and Zhang, Danyang and Wang, Yipu and Wang, Heng and Yang, Diyi and Zhong, Victor and Charles, Y. and Yang, Zhilin and Yu, Tao},
301
+ year={2025},
302
+ url={https://opencua.xlang.ai/}
303
+ }
304
+ ```