Improve model card for WebDancer
#2
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,12 +1,171 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
|
|
3 |
---
|
4 |
-
You can download the model then run the inference scipts in https://github.com/Alibaba-NLP/WebAgent.
|
5 |
|
6 |
-
|
7 |
|
8 |
-
|
9 |
|
10 |
-
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
pipeline_tag: image-text-to-text
|
4 |
+
library_name: transformers
|
5 |
---
|
|
|
6 |
|
7 |
+
# WebDancer: Towards Autonomous Information Seeking Agency
|
8 |
|
9 |
+
<div align="center">
|
10 |
|
11 |
+
[📄Paper](https://huggingface.co/papers/2505.22648) | [🏠Homepage](https://osatlas.github.io/) | [💻Code](https://github.com/Alibaba-NLP/WebAgent)
|
12 |
|
13 |
+
</div>
|
14 |
+
|
15 |
+
WebDancer is a native agentic search reasoning model using the ReAct framework, aiming towards autonomous information seeking agency and _Deep Research_-like capabilities. This model was presented in the paper [WebDancer: Towards Autonomous Information Seeking Agency](https://huggingface.co/papers/2505.22648).
|
16 |
+
|
17 |
+
Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. WebDancer presents a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. The approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of this training paradigm.
|
18 |
+
|
19 |
+
<div align="center">
|
20 |
+
<p align="center">
|
21 |
+
<img src="https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494" width="800px" />
|
22 |
+
</p>
|
23 |
+
</div>
|
24 |
+
|
25 |
+
## Key Features
|
26 |
+
|
27 |
+
* **Native agentic search reasoning model**: Utilizes the ReAct framework for autonomous information seeking, inspired by "Deep Research"-like models.
|
28 |
+
* **Four-stage training paradigm**: Comprises browsing data construction, trajectory sampling, supervised fine-tuning for effective cold start, and reinforcement learning for improved generalization, enabling the agent to autonomously acquire search and reasoning skills.
|
29 |
+
* **Data-centric approach**: Integrates trajectory-level supervision fine-tuning and reinforcement learning (DAPO) to develop a scalable pipeline for training agentic systems via SFT or RL.
|
30 |
+
* **Strong Performance**: WebDancer achieves a Pass@3 score of 64.1% on GAIA and 62.0% on WebWalkerQA.
|
31 |
+
|
32 |
+
## Quick Start
|
33 |
+
|
34 |
+
You need to enter the `WebDancer` folder in the [WebAgent repository](https://github.com/Alibaba-NLP/WebAgent) for the following commands.
|
35 |
+
|
36 |
+
### Step 0: Set Up the Environment
|
37 |
+
|
38 |
+
```bash
|
39 |
+
conda create -n webdancer python=3.12
|
40 |
+
pip install -r requirements.txt
|
41 |
+
```
|
42 |
+
|
43 |
+
For `WebDancer-QwQ-32B` (which is based on Qwen2-VL), ensure that the necessary dependencies are installed:
|
44 |
+
```bash
|
45 |
+
pip install transformers
|
46 |
+
pip install qwen-vl-utils
|
47 |
+
```
|
48 |
+
|
49 |
+
### Step 1: Deploy the Model
|
50 |
+
|
51 |
+
Download the WebDancer model from [🤗 HuggingFace](https://huggingface.co/Alibaba-NLP/WebDancer-QwQ-32B) and deploy it using the provided scripts with [sglang](https://github.com/sgl-project/sglang).
|
52 |
+
|
53 |
+
```bash
|
54 |
+
cd scripts
|
55 |
+
bash deploy_model.sh WebDancer_PATH
|
56 |
+
```
|
57 |
+
|
58 |
+
> **Note:** Replace `WebDancer_PATH` with the actual path to the downloaded model.
|
59 |
+
|
60 |
+
### Step 2: Run the Demo
|
61 |
+
|
62 |
+
Edit the following keys in [`WebDancer/scripts/run_demo.sh`](https://github.com/Alibaba-NLP/WebAgent/blob/main/WebDancer/scripts/run_demo.sh):
|
63 |
+
|
64 |
+
- `GOOGLE_SEARCH_KEY`, you can get it from [serper](https://serper.dev/).
|
65 |
+
- `JINA_API_KEY`, you can get it from [jina](https://jina.ai/api-dashboard/).
|
66 |
+
- `DASHSCOPE_API_KEY`, you can get it from [dashscope](https://dashscope.aliyun.com/).
|
67 |
+
|
68 |
+
Then, launch the demo with Gradio to interact with the WebDancer model:
|
69 |
+
|
70 |
+
```bash
|
71 |
+
cd scripts
|
72 |
+
bash run_demo.sh
|
73 |
+
```
|
74 |
+
|
75 |
+
### Inference Example
|
76 |
+
|
77 |
+
WebDancer models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, remember to convert these relative coordinates back to the original image dimensions.
|
78 |
+
|
79 |
+
Here's a minimal Python inference example for `WebDancer-QwQ-32B` (based on Qwen2-VL):
|
80 |
+
|
81 |
+
```python
|
82 |
+
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
|
83 |
+
from qwen_vl_utils import process_vision_info
|
84 |
+
from PIL import Image
|
85 |
+
import torch
|
86 |
+
|
87 |
+
# Load the model and processor
|
88 |
+
# Default: Load the model on the available device(s)
|
89 |
+
model_name = "Alibaba-NLP/WebDancer-QwQ-32B"
|
90 |
+
model = Qwen2VLForConditionalGeneration.from_pretrained(
|
91 |
+
model_name, torch_dtype="auto", device_map="auto"
|
92 |
+
)
|
93 |
+
processor = AutoProcessor.from_pretrained(model_name)
|
94 |
+
|
95 |
+
# Define the task with an image and text instruction
|
96 |
+
messages = [
|
97 |
+
{
|
98 |
+
"role": "user",
|
99 |
+
"content": [
|
100 |
+
{
|
101 |
+
"type": "image",
|
102 |
+
# Replace with a local image path or a URL
|
103 |
+
"image": "./examples/images/web_6f93090a-81f6-489e-bb35-1a2838b18c01.png",
|
104 |
+
},
|
105 |
+
{"type": "text", "text": "In this UI screenshot, what is the position of the element corresponding to the command \"switch language of current page\" (with bbox)?"},
|
106 |
+
],
|
107 |
+
}
|
108 |
+
]
|
109 |
+
|
110 |
+
# Preparation for inference
|
111 |
+
text = processor.apply_chat_template(
|
112 |
+
messages, tokenize=False, add_generation_prompt=True
|
113 |
+
)
|
114 |
+
image_inputs, video_inputs = process_vision_info(messages) # video_inputs will be empty for this task
|
115 |
+
inputs = processor(
|
116 |
+
text=[text],
|
117 |
+
images=image_inputs,
|
118 |
+
videos=video_inputs,
|
119 |
+
padding=True,
|
120 |
+
return_tensors="pt",
|
121 |
+
)
|
122 |
+
inputs = inputs.to("cuda") # Ensure inputs are on the correct device
|
123 |
+
|
124 |
+
# Inference: Generation of the output
|
125 |
+
generated_ids = model.generate(**inputs, max_new_tokens=128)
|
126 |
+
|
127 |
+
generated_ids_trimmed = [
|
128 |
+
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
129 |
+
]
|
130 |
+
|
131 |
+
output_text = processor.batch_decode(
|
132 |
+
generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
|
133 |
+
)
|
134 |
+
print(output_text[0])
|
135 |
+
# Expected output: <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|>
|
136 |
+
```
|
137 |
+
|
138 |
+
## Demos
|
139 |
+
|
140 |
+
WebDancer can execute long-horizon tasks with multiple steps and complex reasoning, such as web traversal, information seeking, and question answering.
|
141 |
+
|
142 |
+
<div align="center">
|
143 |
+
<h3>WebWalkerQA</h3>
|
144 |
+
<video src="https://github.com/user-attachments/assets/0bbaf55b-897e-4c57-967d-a6e8bbd2167e" />
|
145 |
+
</div>
|
146 |
+
|
147 |
+
<div align="center">
|
148 |
+
<h3>GAIA</h3>
|
149 |
+
<video src="https://github.com/user-attachments/assets/935c668e-6169-4712-9c04-ac80f0531872" />
|
150 |
+
</div>
|
151 |
+
|
152 |
+
<div align="center">
|
153 |
+
<h3>Daily Use</h3>
|
154 |
+
<video src="https://github.com/user-attachments/assets/d1d5b533-4009-478b-bd87-96b86389327d" />
|
155 |
+
</div>
|
156 |
+
|
157 |
+
## Citation
|
158 |
+
|
159 |
+
If this work is helpful, please kindly cite as:
|
160 |
+
|
161 |
+
```bibtex
|
162 |
+
@misc{wu2025webdancer,
|
163 |
+
title={WebDancer: Towards Autonomous Information Seeking Agency},
|
164 |
+
author={Jialong Wu and Baixuan Li and Runnan Fang and Wenbiao Yin and Liwen Zhang and Zhengwei Tao and Dingchu Zhang and Zekun Xi and Yong Jiang and Pengjun Xie and Fei Huang and Jingren Zhou},
|
165 |
+
year={2025},
|
166 |
+
eprint={2505.22648},
|
167 |
+
archivePrefix={arXiv},
|
168 |
+
primaryClass={cs.CL},
|
169 |
+
url={https://arxiv.org/abs/2505.22648},
|
170 |
+
}
|
171 |
+
```
|