Video-to-Video
Diffusers
Safetensors
i2v
nielsr HF Staff commited on
Commit
0e7db38
Β·
verified Β·
1 Parent(s): 39f6f40

Improve model card: Add pipeline tag, library name, and enrich content with abstract, usage, and results

Browse files

This PR enhances the model card for the ROSE model by:
- Adding `pipeline_tag: video-to-video` to improve discoverability on the Hub.
- Adding `library_name: diffusers`, as the model is built on Diffusion Transformers and uses `diffusers` components, enabling automated usage snippets.
- Updating the paper link to the official Hugging Face paper page: https://huggingface.co/papers/2508.18633.
- Including the paper's abstract to provide a comprehensive overview of the model's capabilities and methodology.
- Adding explicit links to the project page, GitHub repository, and Hugging Face Space for easy access.
- Incorporating detailed "Dependencies and Installation" and "Usage (Quick Test)" sections with code snippets directly from the official GitHub README.
- Adding a "Results" section with visual examples and an "Overview" diagram from the GitHub README (with updated image links to raw GitHub URLs).
- Including the "Citation" and "Acknowledgement" sections for proper attribution.

These changes significantly improve the model card's completeness and user-friendliness.

Files changed (1) hide show
  1. README.md +241 -2
README.md CHANGED
@@ -1,6 +1,245 @@
1
  ---
2
- license: apache-2.0
3
  base_model:
4
  - alibaba-pai/Wan2.1-Fun-1.3B-InP
 
 
 
5
  ---
6
- This repository contains the finetuned WanTransformer3D weights for [ROSE](https://rose2025-inpaint.github.io/). For instructions about how to use our model, please refer to our [Github repository](https://github.com/Kunbyte-AI/ROSE) and [HuggingFace Space](https://huggingface.co/spaces/Kunbyte/ROSE).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
 
2
  base_model:
3
  - alibaba-pai/Wan2.1-Fun-1.3B-InP
4
+ license: apache-2.0
5
+ pipeline_tag: video-to-video
6
+ library_name: diffusers
7
  ---
8
+
9
+ # ROSE: Remove Objects with Side Effects in Videos
10
+
11
+ This repository contains the finetuned WanTransformer3D weights for **ROSE**, a model for removing objects with side effects in videos.
12
+
13
+ [\ud83d\udcda Paper](https://huggingface.co/papers/2508.18633) - [\ud83c\udf10 Project Page](https://rose2025-inpaint.github.io/) - [\ud83d\udcbb Code](https://github.com/Kunbyte-AI/ROSE) - [\ud83e\udd17 Demo](https://huggingface.co/spaces/Kunbyte/ROSE)
14
+
15
+ ## Abstract
16
+ Video object removal has achieved advanced performance due to the recent success of video generative models. However, when addressing the side effects of objects, e.g., their shadows and reflections, existing works struggle to eliminate these effects for the scarcity of paired video data as supervision. This paper presents ROSE, termed Remove Objects with Side Effects, a framework that systematically studies the object's effects on environment, which can be categorized into five common cases: shadows, reflections, light, translucency and mirror. Given the challenges of curating paired videos exhibiting the aforementioned effects, we leverage a 3D rendering engine for synthetic data generation. We carefully construct a fully-automatic pipeline for data preparation, which simulates a large-scale paired dataset with diverse scenes, objects, shooting angles, and camera trajectories. ROSE is implemented as an video inpainting model built on diffusion transformer. To localize all object-correlated areas, the entire video is fed into the model for reference-based erasing. Moreover, additional supervision is introduced to explicitly predict the areas affected by side effects, which can be revealed through the differential mask between the paired videos. To fully investigate the model performance on various side effect removal, we presents a new benchmark, dubbed ROSE-Bench, incorporating both common scenarios and the five special side effects for comprehensive evaluation. Experimental results demonstrate that ROSE achieves superior performance compared to existing video object erasing models and generalizes well to real-world video scenarios.
17
+
18
+ ## Dependencies and Installation
19
+
20
+ 1. **Clone Repo**
21
+ ```bash
22
+ git clone https://github.com/Kunbyte-AI/ROSE.git
23
+ ```
24
+
25
+ 2. **Create Conda Environment and Install Dependencies**
26
+ ```bash
27
+ # create new anaconda env
28
+ conda create -n rose python=3.12 -y
29
+ conda activate rose
30
+
31
+ # install python dependencies
32
+ pip3 install -r requirements.txt
33
+ ```
34
+ - CUDA = 12.4
35
+ - PyTorch = 2.6.0
36
+ - Torchvision = 0.21.0
37
+ - Other required packages in `requirements.txt`
38
+
39
+ ## Usage (Quick Test)
40
+
41
+ To get started, you need to prepare the pretrained models first.
42
+
43
+ 1. **Prepare pretrained models**
44
+ We use pretrained [`Wan2.1-Fun-1.3B-InP`](https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP) as our base model. During training, we only train the WanTransformer3D part and keep other parts frozen. You can download the weight of Transformer3D of ROSE from this [`link`](https://huggingface.co/Kunbyte/ROSE).
45
+
46
+ For local inference, the `weights` directory should be arranged like this:
47
+ ```
48
+ weights
49
+ β”œβ”€β”€ transformer
50
+ β”œβ”€β”€ config.json
51
+ β”œβ”€β”€ diffusion_pytorch_model.safetensors
52
+ ```
53
+
54
+ Also, it's necessary to prepare the base model in the models directory. You can download the Wan2.1-Fun-1.3B-InP base model from this [`link`](https://huggingface.co/alibaba-pai/Wan2.1-Fun-1.3B-InP).
55
+
56
+ The `models` directory will be arranged like this:
57
+ ```
58
+ models
59
+ β”œβ”€β”€ Wan2.1-Fun-1.3B-InP
60
+ β”œβ”€β”€ google
61
+ β”œβ”€β”€ umt5-xxl
62
+ β”œβ”€β”€ spiece.model
63
+ β”œβ”€β”€ special_tokens_map.json
64
+ ...
65
+ β”œβ”€β”€ xlm-roberta-large
66
+ β”œβ”€β”€ sentencepiece.bpe.model
67
+ β”œβ”€β”€ tokenizer_config.json
68
+ ...
69
+ β”œβ”€β”€ config.json
70
+ β”œβ”€β”€ configuration.json
71
+ β”œβ”€β”€ diffusion_pytorch_model.safetensors
72
+ β”œβ”€β”€ models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
73
+ β”œβ”€β”€ models_t5_umt5-xxl-enc-bf16.pth
74
+ β”œβ”€β”€ Wan2.1_VAE.pth
75
+ ```
76
+
77
+ 2. **Run Inference**
78
+ We provide some examples in the [`data/eval`](https://github.com/Kunbyte-AI/ROSE/tree/main/data/eval) folder. Run the following command to try it out:
79
+ ```shell
80
+ python inference.py \
81
+ --validation_videos "path/to/your/video.mp4" \
82
+ --validation_masks "path/to/your/mask.mp4" \
83
+ --validation_prompts "" \
84
+ --output_dir "./output" \
85
+ --video_length 16 \
86
+ --sample_size 480 720
87
+ ```
88
+ For more options, refer to the usage information in the GitHub repository:
89
+ ```
90
+ Usage:
91
+
92
+ python inference.py [options]
93
+
94
+ Options:
95
+ --validation_videos Path(s) to input videos
96
+ --validation_masks Path(s) to mask videos
97
+ --validation_prompts Text prompts (default: [""])
98
+ --output_dir Output directory
99
+ --video_length Number of frames per video (It needs to be 16n+1.)
100
+ --sample_size Frame size: height width (default: 480 720)
101
+
102
+ ```
103
+ An interactive demo is also available on [Hugging Face Spaces](https://huggingface.co/spaces/Kunbyte/ROSE).
104
+
105
+ ## Results
106
+
107
+ ### Shadow
108
+ <table>
109
+ <thead>
110
+ <tr>
111
+ <th>Masked Input</th>
112
+ <th>Output</th>
113
+ </tr>
114
+ </thead>
115
+ <tbody>
116
+ <tr>
117
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-2/masked.gif" width="100%"></td>
118
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-2/output.gif" width="100%"> </td>
119
+ </tr>
120
+ <tr>
121
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-7/masked.gif" width="100%"></td>
122
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Shadow/example-7/output.gif" width="100%"></td>
123
+ </tr>
124
+ </tbody>
125
+ </table>
126
+
127
+ ### Reflection
128
+ <table>
129
+ <thead>
130
+ <tr>
131
+ <th>Masked Input</th>
132
+ <th>Output</th>
133
+ </tr>
134
+ </thead>
135
+ <tbody>
136
+ <tr>
137
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-1/masked.gif" width="100%"></td>
138
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-1/output.gif" width="100%"></td>
139
+ </tr>
140
+ <tr>
141
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-2/masked.gif" width="100%"></td>
142
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Reflection/example-2/output.gif" width="100%"></td>
143
+ </tr>
144
+ </tbody>
145
+ </table>
146
+
147
+ ### Common
148
+ <table>
149
+ <thead>
150
+ <tr>
151
+ <th>Masked Input</th>
152
+ <th>Output</th>
153
+ </tr>
154
+ </thead>
155
+ <tbody>
156
+ <tr>
157
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-3/masked.gif" width="100%"></td>
158
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-3/output.gif" width="100%"></td>
159
+ </tr>
160
+ <tr>
161
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-15/masked.gif" width="100%"></td>
162
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Common/example-15/output.gif" width="100%"></td>
163
+ </tr>
164
+ </tbody>
165
+ </table>
166
+
167
+ ### Light Source
168
+ <table>
169
+ <thead>
170
+ <tr>
171
+ <th>Masked Input</th>
172
+ <th>Output</th>
173
+ </tr>
174
+ </thead>
175
+ <tbody>
176
+ <tr>
177
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-4/masked.gif" width="100%"></td>
178
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-4/output.gif" width="100%"></td>
179
+ </tr>
180
+ <tr>
181
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-10/masked.gif" width="100%"></td>
182
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Light_source/example-10/output.gif" width="100%"></td>
183
+ </tr>
184
+ </tbody>
185
+ </table>
186
+
187
+ ### Translucent
188
+ <table>
189
+ <thead>
190
+ <tr>
191
+ <th>Masked Input</th>
192
+ <th>Output</th>
193
+ </tr>
194
+ </thead>
195
+ <tbody>
196
+ <tr>
197
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-4/masked.gif" width="100%"></td>
198
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-4/output.gif" width="100%"></td>
199
+ </tr>
200
+ <tr>
201
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-5/masked.gif" width="100%"></td>
202
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Translucent/example-5/output.gif" width="100%"></td>
203
+ </tr>
204
+ </tbody>
205
+ </table>
206
+
207
+ ### Mirror
208
+ <table>
209
+ <thead>
210
+ <tr>
211
+ <th>Masked Input</th>
212
+ <th>Output</th>
213
+ </tr>
214
+ </thead>
215
+ <tbody>
216
+ <tr>
217
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-1/masked.gif" width="100%"></td>
218
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-1/output.gif" width="100%"></td>
219
+ </tr>
220
+ <tr>
221
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-2/masked.gif" width="100%"></td>
222
+ <td><img src="https://github.com/Kunbyte-AI/ROSE/raw/main/assets/Mirror/example-2/output.gif" width="100%"></td>
223
+ </tr>
224
+ </tbody>
225
+ </table>
226
+
227
+ ## Overview
228
+ ![overall_structure](https://github.com/Kunbyte-AI/ROSE/raw/main/assets/rose_pipeline.png)
229
+
230
+ ## Citation
231
+
232
+ If you find our repo useful for your research, please consider citing our paper:
233
+
234
+ ```bibtex
235
+ @article{miao2025rose,
236
+ title={ROSE: Remove Objects with Side Effects in Videos},
237
+ author={Miao, Chenxuan and Feng, Yutong and Zeng, Jianshu and Gao, Zixiang and Liu, Hantang and Yan, Yunfeng and Qi, Donglian and Chen, Xi and Wang, Bin and Zhao, Hengshuang},
238
+ journal={arXiv preprint arXiv:2508.18633},
239
+ year={2025}
240
+ }
241
+ ```
242
+
243
+ ## Acknowledgement
244
+
245
+ This code is based on [Wan2.1-Fun-1.3B-Inpaint](https://github.com/aigc-apps/VideoX-Fun) and some code are brought from [ProPainter](https://github.com/sczhou/ProPainter). Thanks for their awesome works!