nielsr HF Staff commited on
Commit
2034d04
·
verified ·
1 Parent(s): b317a15

Improve model card for WebDancer

Browse files

This PR significantly improves the model card for `WebDancer-QwQ-32B` by:

* **Adding relevant metadata**: `pipeline_tag: image-text-to-text` and `library_name: transformers`. This ensures the model appears in relevant searches (e.g., `transformers` models, or models with `image-text-to-text` capabilities).
* **Updating the content**: The previous text about `WebSailor` has been replaced with accurate information about `WebDancer`, including its key features, approach, and performance.
* **Comprehensive Links**: Direct links to the official paper, project homepage, and the GitHub repository are now included for easy access to all resources.
* **Usage Instructions**: Detailed "Quick Start" instructions from the GitHub repository are provided, along with a concise Python usage example using the `transformers` library, making it easy for users to deploy and interact with the model.
* **Demos**: Visual demos of WebDancer's capabilities on various benchmarks have been added.

Please review and merge this PR.

Files changed (1) hide show
  1. README.md +164 -5
README.md CHANGED
@@ -1,12 +1,171 @@
1
  ---
2
  license: apache-2.0
 
 
3
  ---
4
- You can download the model then run the inference scipts in https://github.com/Alibaba-NLP/WebAgent.
5
 
6
- - **WebSailor** is a complete post-training methodology designed to teach LLM agents sophisticated reasoning for complex web navigation and information-seeking tasks. It addresses the challenge of extreme uncertainty in vast information landscapes, a capability where previous open-source models lagged behind proprietary systems.
7
 
8
- - We classify information-seeking tasks into three difficulty levels, where **Level 3** represents problems with both high uncertainty and a complex, non-linear path to a solution. To generate these challenging tasks, we introduce **SailorFog-QA**, a novel data synthesis pipeline that constructs intricate knowledge graphs and then applies information obfuscation. This process creates questions with high initial uncertainty that demand creative exploration and transcend simple, structured reasoning patterns.
9
 
10
- - Our training process begins by generating expert trajectories and then reconstructing the reasoning to create concise, action-oriented supervision signals, avoiding the stylistic and verbosity issues of teacher models. The agent is first given a "cold start" using rejection sampling fine-tuning (RFT) on a small set of high-quality examples to establish a baseline capability. This is followed by an efficient agentic reinforcement learning stage using our **Duplicating Sampling Policy Optimization (DUPO)** algorithm, which refines the agent's exploratory strategies.
11
 
12
- - WebSailor establishes a **new state-of-the-art for open-source agents**, achieving outstanding results on difficult benchmarks like BrowseComp-en and BrowseComp-zh. Notably, our smaller models like WebSailor-7B outperform agents built on much larger backbones, highlighting the efficacy of our training paradigm. Ultimately, WebSailor closes the performance gap to proprietary systems, achieving results on par with agents like Doubao-Search.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
  ---
 
6
 
7
+ # WebDancer: Towards Autonomous Information Seeking Agency
8
 
9
+ <div align="center">
10
 
11
+ [📄Paper](https://huggingface.co/papers/2505.22648) | [🏠Homepage](https://osatlas.github.io/) | [💻Code](https://github.com/Alibaba-NLP/WebAgent)
12
 
13
+ </div>
14
+
15
+ WebDancer is a native agentic search reasoning model using the ReAct framework, aiming towards autonomous information seeking agency and _Deep Research_-like capabilities. This model was presented in the paper [WebDancer: Towards Autonomous Information Seeking Agency](https://huggingface.co/papers/2505.22648).
16
+
17
+ Addressing intricate real-world problems necessitates in-depth information seeking and multi-step reasoning. WebDancer presents a cohesive paradigm for building end-to-end agentic information seeking agents from a data-centric and training-stage perspective. The approach consists of four key stages: (1) browsing data construction, (2) trajectories sampling, (3) supervised fine-tuning for effective cold start, and (4) reinforcement learning for enhanced generalisation. Empirical evaluations on the challenging information seeking benchmarks, GAIA and WebWalkerQA, demonstrate the strong performance of WebDancer, achieving considerable results and highlighting the efficacy of this training paradigm.
18
+
19
+ <div align="center">
20
+ <p align="center">
21
+ <img src="https://github.com/user-attachments/assets/cf2ee020-5e15-4087-9a7e-75cc43662494" width="800px" />
22
+ </p>
23
+ </div>
24
+
25
+ ## Key Features
26
+
27
+ * **Native agentic search reasoning model**: Utilizes the ReAct framework for autonomous information seeking, inspired by "Deep Research"-like models.
28
+ * **Four-stage training paradigm**: Comprises browsing data construction, trajectory sampling, supervised fine-tuning for effective cold start, and reinforcement learning for improved generalization, enabling the agent to autonomously acquire search and reasoning skills.
29
+ * **Data-centric approach**: Integrates trajectory-level supervision fine-tuning and reinforcement learning (DAPO) to develop a scalable pipeline for training agentic systems via SFT or RL.
30
+ * **Strong Performance**: WebDancer achieves a Pass@3 score of 64.1% on GAIA and 62.0% on WebWalkerQA.
31
+
32
+ ## Quick Start
33
+
34
+ You need to enter the `WebDancer` folder in the [WebAgent repository](https://github.com/Alibaba-NLP/WebAgent) for the following commands.
35
+
36
+ ### Step 0: Set Up the Environment
37
+
38
+ ```bash
39
+ conda create -n webdancer python=3.12
40
+ pip install -r requirements.txt
41
+ ```
42
+
43
+ For `WebDancer-QwQ-32B` (which is based on Qwen2-VL), ensure that the necessary dependencies are installed:
44
+ ```bash
45
+ pip install transformers
46
+ pip install qwen-vl-utils
47
+ ```
48
+
49
+ ### Step 1: Deploy the Model
50
+
51
+ Download the WebDancer model from [🤗 HuggingFace](https://huggingface.co/Alibaba-NLP/WebDancer-QwQ-32B) and deploy it using the provided scripts with [sglang](https://github.com/sgl-project/sglang).
52
+
53
+ ```bash
54
+ cd scripts
55
+ bash deploy_model.sh WebDancer_PATH
56
+ ```
57
+
58
+ > **Note:** Replace `WebDancer_PATH` with the actual path to the downloaded model.
59
+
60
+ ### Step 2: Run the Demo
61
+
62
+ Edit the following keys in [`WebDancer/scripts/run_demo.sh`](https://github.com/Alibaba-NLP/WebAgent/blob/main/WebDancer/scripts/run_demo.sh):
63
+
64
+ - `GOOGLE_SEARCH_KEY`, you can get it from [serper](https://serper.dev/).
65
+ - `JINA_API_KEY`, you can get it from [jina](https://jina.ai/api-dashboard/).
66
+ - `DASHSCOPE_API_KEY`, you can get it from [dashscope](https://dashscope.aliyun.com/).
67
+
68
+ Then, launch the demo with Gradio to interact with the WebDancer model:
69
+
70
+ ```bash
71
+ cd scripts
72
+ bash run_demo.sh
73
+ ```
74
+
75
+ ### Inference Example
76
+
77
+ WebDancer models accept images of any size as input. The model outputs are normalized to relative coordinates within a 0-1000 range (either a center point or a bounding box defined by top-left and bottom-right coordinates). For visualization, remember to convert these relative coordinates back to the original image dimensions.
78
+
79
+ Here's a minimal Python inference example for `WebDancer-QwQ-32B` (based on Qwen2-VL):
80
+
81
+ ```python
82
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
83
+ from qwen_vl_utils import process_vision_info
84
+ from PIL import Image
85
+ import torch
86
+
87
+ # Load the model and processor
88
+ # Default: Load the model on the available device(s)
89
+ model_name = "Alibaba-NLP/WebDancer-QwQ-32B"
90
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
91
+ model_name, torch_dtype="auto", device_map="auto"
92
+ )
93
+ processor = AutoProcessor.from_pretrained(model_name)
94
+
95
+ # Define the task with an image and text instruction
96
+ messages = [
97
+ {
98
+ "role": "user",
99
+ "content": [
100
+ {
101
+ "type": "image",
102
+ # Replace with a local image path or a URL
103
+ "image": "./examples/images/web_6f93090a-81f6-489e-bb35-1a2838b18c01.png",
104
+ },
105
+ {"type": "text", "text": "In this UI screenshot, what is the position of the element corresponding to the command \"switch language of current page\" (with bbox)?"},
106
+ ],
107
+ }
108
+ ]
109
+
110
+ # Preparation for inference
111
+ text = processor.apply_chat_template(
112
+ messages, tokenize=False, add_generation_prompt=True
113
+ )
114
+ image_inputs, video_inputs = process_vision_info(messages) # video_inputs will be empty for this task
115
+ inputs = processor(
116
+ text=[text],
117
+ images=image_inputs,
118
+ videos=video_inputs,
119
+ padding=True,
120
+ return_tensors="pt",
121
+ )
122
+ inputs = inputs.to("cuda") # Ensure inputs are on the correct device
123
+
124
+ # Inference: Generation of the output
125
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
126
+
127
+ generated_ids_trimmed = [
128
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
129
+ ]
130
+
131
+ output_text = processor.batch_decode(
132
+ generated_ids_trimmed, skip_special_tokens=False, clean_up_tokenization_spaces=False
133
+ )
134
+ print(output_text[0])
135
+ # Expected output: <|object_ref_start|>language switch<|object_ref_end|><|box_start|>(576,12),(592,42)<|box_end|><|im_end|>
136
+ ```
137
+
138
+ ## Demos
139
+
140
+ WebDancer can execute long-horizon tasks with multiple steps and complex reasoning, such as web traversal, information seeking, and question answering.
141
+
142
+ <div align="center">
143
+ <h3>WebWalkerQA</h3>
144
+ <video src="https://github.com/user-attachments/assets/0bbaf55b-897e-4c57-967d-a6e8bbd2167e" />
145
+ </div>
146
+
147
+ <div align="center">
148
+ <h3>GAIA</h3>
149
+ <video src="https://github.com/user-attachments/assets/935c668e-6169-4712-9c04-ac80f0531872" />
150
+ </div>
151
+
152
+ <div align="center">
153
+ <h3>Daily Use</h3>
154
+ <video src="https://github.com/user-attachments/assets/d1d5b533-4009-478b-bd87-96b86389327d" />
155
+ </div>
156
+
157
+ ## Citation
158
+
159
+ If this work is helpful, please kindly cite as:
160
+
161
+ ```bibtex
162
+ @misc{wu2025webdancer,
163
+ title={WebDancer: Towards Autonomous Information Seeking Agency},
164
+ author={Jialong Wu and Baixuan Li and Runnan Fang and Wenbiao Yin and Liwen Zhang and Zhengwei Tao and Dingchu Zhang and Zekun Xi and Yong Jiang and Pengjun Xie and Fei Huang and Jingren Zhou},
165
+ year={2025},
166
+ eprint={2505.22648},
167
+ archivePrefix={arXiv},
168
+ primaryClass={cs.CL},
169
+ url={https://arxiv.org/abs/2505.22648},
170
+ }
171
+ ```