microsoft
/

GUI-Actor-7B-Qwen2-VL

@@ -6,14 +6,109 @@ base_model:
 # GUI-Actor-7B with Qwen2-VL-7B as backbone VLM
-[GUI-Actor-7B-Qwen2-VL]() | [GUI-Actor-2B-Qwen-2-VL]() | [GUI-Actor-7B-Qwen2.5-VL]() | [GUI-Actor-3B-Qwen2.5-VL]()
-This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents** (Wu et al, 2025)](https://arxiv.org/abs/2403.12968).
-It is developed based on [Qwen2-VL-7B-Instruct ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset [here]() (will update later).
 For more details on model design and evaluation, please check the project page at [GUI-Actor](https://aka.ms/GUI-Actor).
-## Usage
 ## Citation
 ```

 # GUI-Actor-7B with Qwen2-VL-7B as backbone VLM
+- [GUI-Actor-7B-Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2-VL)
+- [GUI-Actor-2B-Qwen2-VL](https://huggingface.co/microsoft/GUI-Actor-2B-Qwen2-VL)
+- [GUI-Actor-7B-Qwen2.5-VL](https://huggingface.co/microsoft/GUI-Actor-7B-Qwen2.5-VL)
+- [GUI-Actor-3B-Qwen2.5-VL](https://huggingface.co/microsoft/GUI-Actor-3B-Qwen2.5-VL)
+- [GUI-Actor-Verifier-2B](https://huggingface.co/microsoft/GUI-Actor-Verifier-2B)
+This model was introduced in the paper [**GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents** (Wu et al, 2025)](https://github.com/microsoft/GUI-Actor).
+It is developed based on [Qwen2-VL-7B-Instruct ](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), augmented by an attention-based action head and finetuned to perform GUI grounding using the dataset [here (coming soon)]().
 For more details on model design and evaluation, please check the project page at [GUI-Actor](https://aka.ms/GUI-Actor).
+## 📊 Performance Comparison on GUI Grounding Benchmarks
+Table 1. Main results on ScreenSpot-Pro, ScreenSpot, and ScreenSpot-v2 with **Qwen2-VL** as the backbone. † indicates scores obtained from our own evaluation of the official models on Huggingface.
+| Method           | Backbone VLM | ScreenSpot-Pro | ScreenSpot | ScreenSpot-v2 |
+|------------------|--------------|----------------|------------|----------------|
+| **_72B models:_**
+| AGUVIS-72B       | Qwen2-VL     | -              | 89.2       | -              |
+| UGround-V1-72B   | Qwen2-VL     | 34.5           | **89.4**   | -              |
+| UI-TARS-72B      | Qwen2-VL     | **38.1**       | 88.4       | **90.3**       |
+| **_7B models:_**
+| OS-Atlas-7B      | Qwen2-VL     | 18.9           | 82.5       | 84.1           |
+| AGUVIS-7B        | Qwen2-VL     | 22.9           | 84.4       | 86.0†          |
+| UGround-V1-7B    | Qwen2-VL     | 31.1           | 86.3       | 87.6†          |
+| UI-TARS-7B       | Qwen2-VL     | 35.7           | **89.5**   | **91.6**       |
+| GUI-Actor-7B     | Qwen2-VL     | **40.7**       | 88.3       | 89.5           |
+| GUI-Actor-7B + Verifier     | Qwen2-VL    | 44.2       | 89.7       | 90.9           |
+| **_2B models:_**
+| UGround-V1-2B    | Qwen2-VL     | 26.6           | 77.1       | -              |
+| UI-TARS-2B       | Qwen2-VL     | 27.7           | 82.3       | 84.7           |
+| GUI-Actor-2B     | Qwen2-VL     | **36.7**       | **86.5**   | **88.6**       |
+| GUI-Actor-2B + Verifier     | Qwen2-VL    | 41.8       | 86.9       | 89.3           |
+Table 2. Main results on the ScreenSpot-Pro and ScreenSpot-v2 with **Qwen2.5-VL** as the backbone.
+| Method         | Backbone VLM | ScreenSpot-Pro | ScreenSpot-v2 |
+|----------------|---------------|----------------|----------------|
+| **_7B models:_**
+| Qwen2.5-VL-7B  | Qwen2.5-VL    | 27.6           | 88.8           |
+| Jedi-7B        | Qwen2.5-VL    | 39.5           | 91.7           |
+| GUI-Actor-7B   | Qwen2.5-VL    | **44.6**       | **92.1**       |
+| GUI-Actor-7B + Verifier   | Qwen2.5-VL    | 47.7       | 92.5       |
+| **_3B models:_**
+| Qwen2.5-VL-3B  | Qwen2.5-VL    | 25.9           | 80.9           |
+| Jedi-3B        | Qwen2.5-VL    | 36.1           | 88.6           |
+| GUI-Actor-3B   | Qwen2.5-VL    | **42.2**       | **91.0**       |
+| GUI-Actor-3B + Verifier   | Qwen2.5-VL    | 45.9       | 92.4       |
+## 🚀 Usage
+```python
+import torch
+from qwen_vl_utils import process_vision_info
+from datasets import load_dataset
+from transformers import Qwen2VLProcessor
+from gui_actor.constants import chat_template
+from gui_actor.modeling import Qwen2VLForConditionalGenerationWithActionHead
+from gui_actor.inference import inference
+# load model
+model_name_or_path = "microsoft/GUI-Actor-7B-Qwen2-VL"
+data_processor = Qwen2VLProcessor.from_pretrained(model_name_or_path)
+tokenizer = data_processor.tokenizer
+model = Qwen2VLForConditionalGenerationWithActionHead.from_pretrained(
+    model_name_or_path,
+    torch_dtype=torch.bfloat16,
+    device_map="cuda:0",
+    attn_implementation="flash_attention_2"
+).eval()
+# prepare example
+dataset = load_dataset("rootsautomation/ScreenSpot")["test"]
+example = dataset[0]
+conversation = [
+    {
+        "role": "system",
+        "content": [
+            {
+                "type": "text",
+                "text": "You are a GUI agent. You are given a task and a screenshot of the screen. You need to perform a series of pyautogui actions to complete the task.",
+            }
+        ]
+    },
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": example["image"], # PIL.Image.Image or str to path
+                # "image_url": "https://xxxxx.png" or "https://xxxxx.jpg" or "file://xxxxx.png" or "data:image/png;base64,xxxxxxxx", will be split by "base64,"
+            },
+            {
+                "type": "text",
+                "text": example["instruction"]
+            },
+        ],
+    },
+]
+# inference
+pred = inference(conversation, model, tokenizer, data_processor, logits_processor=logits_processor_actor, use_placeholder=True, topk=3)
+px, py = pred["topk_points"][0]
+print(f"Click point: [{px}, {py}]")
+```
 ## Citation
 ```