--- license: apache-2.0 language: - en base_model: - Qwen/Qwen2-VL-2B-Instruct pipeline_tag: image-text-to-text --- # InfiGUIAgent-2B-Stage1 This repository contains the **Stage 1 model** from the [InfiGUIAgent](https://arxiv.org/pdf/2501.04575) paper. The model is based on `Qwen2-VL-2B-Instruct` and enhanced with Supervised Fine-Tuning (SFT) on extensive GUI task data to improve fundamental GUI understanding capabilities. ## Quick Start ### Installation First install required dependencies: ```bash pip install transformers qwen-vl-utils ``` ### GUI Element Localization Example ```python import cv2 import json import torch import requests from PIL import Image from transformers import Qwen2VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info # Load model and processor model = Qwen2VLForConditionalGeneration.from_pretrained( "Reallm-Labs/InfiGUIAgent-2B-Stage1", torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto" ) processor = AutoProcessor.from_pretrained("Reallm-Labs/InfiGUIAgent-2B-Stage1") # Prepare inputs img_url = "https://raw.githubusercontent.com/Reallm-Labs/InfiGUIAgent/main/images/test_img.png" prompt_template = """Output the relative coordinates of the icon, widget, or text most closely related to "{instruction}" in this screenshot, in the format of \"{{\"x\": x, \"y\": y}}\", where x and y are in the positive directions of horizontal left and vertical down respectively, with the origin at the top left corner, and the range is 0-1000.""" # Download image response = requests.get(img_url) with open("test_img.png", "wb") as f: f.write(response.content) # Build message template messages = [{ "role": "user", "content": [ {"type": "image", "image": "test_img.png"}, {"type": "text", "text": prompt_template.format(instruction="View detailed storage space usage")}, ] }] # Process and generate text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, video_inputs = process_vision_info(messages) inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=128) output_text = processor.batch_decode( [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)], skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] # Visualize results try: coords = json.loads(output_text) img = cv2.imread("test_img.png") height, width = img.shape[:2] x = int(coords['x'] * width / 1000) y = int(coords['y'] * height / 1000) cv2.circle(img, (x, y), 10, (0, 0, 255), -1) cv2.putText(img, f"({coords['x']}, {coords['y']})", (x+10, y-10), cv2.FONT_HERSHEY_SIMPLEX, 1.0, (0, 0, 255), 2) cv2.imwrite("output.png", img) except: print("Error: Failed to parse coordinates or process image") print("Predicted coordinates:", output_text) ``` ## Limitations This is a **Stage 1 model** focused on establishing fundamental GUI understanding capabilities. It may demonstrate suboptimal performance on: - Complex reasoning tasks - Multi-step operations - Abstract instruction following For more information, please refer to our [repo](https://github.com/Reallm-Labs/InfiGUIAgent). ## Citation ```bibtex @article{liu2025infiguiagent, title={InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection}, author={Liu, Yuhang and Li, Pengxiang and Wei, Zishu and Xie, Congkai and Hu, Xueyu and Xu, Xinchen and Zhang, Shengyu and Han, Xiaotian and Yang, Hongxia and Wu, Fei}, journal={arXiv preprint arXiv:2501.04575}, year={2025} } ```