File size: 5,002 Bytes
353340d 2b0f823 80a2f49 d942360 80a2f49 ce7e343 0e824c9 80a2f49 ce7e343 80a2f49 ce7e343 80a2f49 ce7e343 80a2f49 ce7e343 80a2f49 ce7e343 80a2f49 ce7e343 80a2f49 ce7e343 80a2f49 a8ea64f 854aaf3 80a2f49 a8ea64f 80a2f49 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 |
---
license: apache-2.0
datasets:
- yiye2023/GUIChat
- yiye2023/GUIEnv
- yiye2023/GUIAct
language:
- en
tags:
- GUI
- Agent
- minicpm
pipeline_tag: visual-question-answering
---
# π±π₯οΈ GUIDance: Vision Langauge Models as Your Screen Guide
Introducing the MiniCPM-GUIDance, Model(referred to MiniCPM-GUI) that trained on [GUICourse](https://arxiv.org/pdf/2406.11317)! π
![image/png](https://cdn-uploads.huggingface.co/production/uploads/63f706dfe94ed998c463ed66/5d4rJFWjKn-c-iOXJKYXF.png)
# News
- 2024-07-09: π We released MiniCPM-GUIDance on [huggingface](https://huggingface.co/RhapsodyAI/minicpm-guidance).
- 2024-06-07: π We released the datasets, loading code, and evaluation code on [github](https://github.com/yiye3/GUICourse).
- 2024-03-09: π¦ We have open-sourced guicourse, [GUIAct](https://huggingface.co/datasets/yiye2023/GUIAct),[GUIChat](https://huggingface.co/datasets/yiye2023/GUIChat), [GUIEnv](https://huggingface.co/datasets/yiye2023/GUIEnv)
# ToDo
[ ] Batch inference
# CookBook
- Prompt for Actions
```
Your Task
{Task}
Generate next actions to do this task.
```
```
Actions History
{hover, select_text, click, scroll}
Information
{Information about the web}
Your Task
{TASK}
Generate next actions to do this task.
```
- Prompt for Chat w or w/o Grounding
```
{Query}
OR
{Query} Grounding all objects in the image.
```
# Example
Pip install all dependencies:
```
Pillow==10.1.0
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
transformers==4.40.0
sentencepiece==0.1.99
flash_attn==2.4.2
```
First you are suggested to git clone this huggingface repo or download repo with huggingface_cli.
```
git lfs install
git clone https://huggingface.co/RhapsodyAI/minicpm-guidance
```
or
```
huggingface-cli download RhapsodyAI/minicpm-guidance
```
Example case image: ![case](https://cdn-uploads.huggingface.co/production/uploads/63f706dfe94ed998c463ed66/KJFeGDBj3SOgQqGAU7lU5.png)
```python
from transformers import AutoProcessor, AutoTokenizer, AutoModel
from PIL import Image
import torch
MODEL_PATH = '/path/to/minicpm-guidance'
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)
# model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, attn_implementation="eager", torch_dtype=torch.bfloat16)
model = AutoModel.from_pretrained(MODEL_PATH, trust_remote_code=True, torch_dtype=torch.bfloat16)
model.cuda().eval()
# Currently only support batch=1
example_messages = [
[
{
"role": "user",
"content": Image.open("./case.png").convert('RGB')
},
{
"role": "user",
"content": "How could I use this model from this web? Grounding all objects in the image."
}
]
]
input = processor(example_messages, padding_side="right")
for key in input:
if isinstance(input[key], list):
for i in range(len(input[key])):
if isinstance(input[key][i], torch.Tensor):
input[key][i] = input[key][i].cuda()
if isinstance(input[key], torch.Tensor):
input[key] = input[key].cuda()
with torch.no_grad():
outputs = model.generate(input, max_new_tokens=1024, do_sample=False, num_beams=3)
text = tokenizer.decode(outputs[0].cpu().tolist())
text = tokenizer.batch_decode(outputs.cpu().tolist())
for i in text:
print('-'*20)
print(i)
'''
To use the model from this webpage, you would typically follow these steps:
1. **Access the Model**: Navigate to the section of the webpage where the model is described. In this case, it's under the heading "Use this model"<box> 864 238 964 256</box> .
2. **Download the Model**: There should be a link or button that allows you to download the model. Look for a button or link that says "Download" or something similar.
3. **Install the Model**: Once you've downloaded the model, you'll need to install it on your system. This typically involves extracting the downloaded file and placing it in a directory where the model can be found.
4. **Use the Model**: After installation, you can use the model in your application or project. This might involve importing the model into your programming environment and using it to perform specific tasks.
The exact steps would depend on the specifics of the model and the environment in which you're using it, but these are the general steps you would follow to use the model from this webpage.</s>
'''
```
# Contact
[Junbo Cui](mailto:[email protected])
# Citation
If you find our work useful, please consider citing us:
```
@misc{,
title={GUICourse: From General Vision Language Models to Versatile GUI Agents},
author={Wentong Chen and Junbo Cui and Jinyi Hu and Yujia Qin and Junjie Fang and Yue Zhao and Chongyi Wang and Jun Liu and Guirong Chen and Yupeng Huo and Yuan Yao and Yankai Lin and Zhiyuan Liu and Maosong Sun},
year={2024},
journal={arXiv preprint arXiv:2406.11317},
}
``` |