InfiGUIAgent
Collection
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection
•
3 items
•
Updated
This repository contains the Stage 1 model from the InfiGUIAgent paper. The model is based on Qwen2-VL-2B-Instruct
and enhanced with Supervised Fine-Tuning (SFT) on extensive GUI task data to improve fundamental GUI understanding capabilities.
First install required dependencies:
pip install transformers qwen-vl-utils
import cv2
import json
import torch
import requests
from PIL import Image
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load model and processor
model = Qwen2VLForConditionalGeneration.from_pretrained(
"Reallm-Labs/InfiGUIAgent-2B-Stage1",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto"
)
processor = AutoProcessor.from_pretrained("Reallm-Labs/InfiGUIAgent-2B-Stage1")
# Prepare inputs
img_url = "https://raw.githubusercontent.com/Reallm-Labs/InfiGUIAgent/main/images/test_img.png"
prompt_template = """Output the relative coordinates of the icon, widget, or text most closely related to "{instruction}" in this screenshot, in the format of \"{{\"x\": x, \"y\": y}}\", where x and y are in the positive directions of horizontal left and vertical down respectively, with the origin at the top left corner, and the range is 0-1000."""
# Download image
response = requests.get(img_url)
with open("test_img.png", "wb") as f:
f.write(response.content)
# Build message template
messages = [{
"role": "user",
"content": [
{"type": "image", "image": "test_img.png"},
{"type": "text", "text": prompt_template.format(instruction="View detailed storage space usage")},
]
}]
# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
output_text = processor.batch_decode(
[out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)],
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
# Visualize results
try:
coords = json.loads(output_text)
img = cv2.imread("test_img.png")
height, width = img.shape[:2]
x = int(coords['x'] * width / 1000)
y = int(coords['y'] * height / 1000)
cv2.circle(img, (x, y), 10, (0, 0, 255), -1)
cv2.putText(img, f"({coords['x']}, {coords['y']})", (x+10, y-10),
cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 0, 255), 2)
cv2.imwrite("output.png", img)
except:
print("Error: Failed to parse coordinates or process image")
print("Predicted coordinates:", output_text)
This is a Stage 1 model focused on establishing fundamental GUI understanding capabilities. It may demonstrate suboptimal performance on:
For more information, please refer to our repo.
@article{liu2025infiguiagent,
title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection},
author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei},
journal={arXiv preprint arXiv:2501.04575},
year={2025}
}