---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2-VL-2B-Instruct
pipeline_tag: image-text-to-text
---

# InfiGUIAgent-2B-Stage1

This repository contains the **Stage 1 model** from the [InfiGUIAgent](https://arxiv.org/pdf/2501.04575) paper. The model is based on `Qwen2-VL-2B-Instruct` and enhanced with Supervised Fine-Tuning (SFT) on extensive GUI task data to improve fundamental GUI understanding capabilities.

## Quick Start

### Installation
First install required dependencies:
```bash
pip install transformers qwen-vl-utils
```

### GUI Element Localization Example
```python
import cv2
import json
import torch
import requests
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Reallm-Labs/InfiGUIAgent-2B-Stage1", 
    torch_dtype=torch.bfloat16, 
    attn_implementation="flash_attention_2", 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("Reallm-Labs/InfiGUIAgent-2B-Stage1")

# Prepare inputs
img_url = "https://raw.githubusercontent.com/Reallm-Labs/InfiGUIAgent/main/images/test_img.png"
prompt_template = """Output the relative coordinates of the icon, widget, or text most closely related to "{instruction}" in this screenshot, in the format of \"{{\"x\": x, \"y\": y}}\", where x and y are in the positive directions of horizontal left and vertical down respectively, with the origin at the top left corner, and the range is 0-1000."""

# Download image
response = requests.get(img_url)
with open("test_img.png", "wb") as f:
    f.write(response.content)

# Build message template
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": "test_img.png"},
        {"type": "text", "text": prompt_template.format(instruction="View detailed storage space usage")},
    ]
}]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
output_text = processor.batch_decode(
    [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

# Visualize results
try:
    coords = json.loads(output_text)
    img = cv2.imread("test_img.png")
    height, width = img.shape[:2]
    x = int(coords['x'] * width / 1000)
    y = int(coords['y'] * height / 1000)
    
    cv2.circle(img, (x, y), 10, (0, 0, 255), -1)
    cv2.putText(img, f"({coords['x']}, {coords['y']})", (x+10, y-10),
                cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 0, 255), 2)
    cv2.imwrite("output.png", img)
except:
    print("Error: Failed to parse coordinates or process image")

print("Predicted coordinates:", output_text)
```

## Limitations
This is a **Stage 1 model** focused on establishing fundamental GUI understanding capabilities. It may demonstrate suboptimal performance on:
- Complex reasoning tasks
- Multi-step operations
- Abstract instruction following


For more information, please refer to our [repo](https://github.com/Reallm-Labs/InfiGUIAgent).

## Citation
```bibtex
@article{liu2025infiguiagent,
  title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
  author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
  journal={arXiv preprint arXiv:2501.04575},
  year={2025}
}
```