bwang3579 commited on
Commit
e3ae15c
·
verified ·
1 Parent(s): 575cb78

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +152 -120
README.md CHANGED
@@ -43,7 +43,7 @@ torchrun --nproc_per_node $parallel run_parallel.py \
43
  --infer_steps 50 \
44
  --save_path ./results \
45
  --cfg_scale 9.0 \
46
- --motion_score 5 \
47
  --time_shift 12.573
48
  ```
49
  ## Motion Control
@@ -115,135 +115,167 @@ The default motion_score = 5 is suitable for general use. If you need more stabi
115
  </tr>
116
  </table>
117
 
 
118
 
119
- ## Table of Contents
120
-
121
- 1. [Introduction](#1-introduction)
122
- 2. [Model Summary](#2-model-summary)
123
- 3. [Model Download](#3-model-download)
124
- 4. [Model Usage](#4-model-usage)
125
- 5. [Benchmark](#5-benchmark)
126
- 6. [Online Engine](#6-online-engine)
127
- 7. [Citation](#7-citation)
128
- 8. [Acknowledgement](#8-ackownledgement)
129
-
130
- ## 1. Introduction
131
- We present **Step-Video-T2V**, a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames. To enhance both training and inference efficiency, we propose a deep compression VAE for videos, achieving 16x16 spatial and 8x temporal compression ratios. Direct Preference Optimization (DPO) is applied in the final stage to further enhance the visual quality of the generated videos. Step-Video-T2V's performance is evaluated on a novel video generation benchmark, **Step-Video-T2V-Eval**, demonstrating its SoTA text-to-video quality compared to both open-source and commercial engines.
132
-
133
- ## 2. Model Summary
134
- In Step-Video-T2V, videos are represented by a high-compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios. User prompts are encoded using two bilingual pre-trained text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames, with text embeddings and timesteps serving as conditioning factors. To further enhance the visual quality of the generated videos, a video-based DPO approach is applied, which effectively reduces artifacts and ensures smoother, more realistic video outputs.
135
-
136
- <p align="center">
137
- <img width="80%" src="assets/model_architecture.png">
138
- </p>
139
-
140
- ### 2.1. Video-VAE
141
- A deep compression Variational Autoencoder (VideoVAE) is designed for video generation tasks, achieving 16x16 spatial and 8x temporal compression ratios while maintaining exceptional video reconstruction quality. This compression not only accelerates training and inference but also aligns with the diffusion process's preference for condensed representations.
142
-
143
- <p align="center">
144
- <img width="70%" src="assets/dcvae.png">
145
- </p>
146
-
147
- ### 2.2. DiT w/ 3D Full Attention
148
- Step-Video-T2V is built on the DiT architecture, which has 48 layers, each containing 48 attention heads, with each head’s dimension set to 128. AdaLN-Single is leveraged to incorporate the timestep condition, while QK-Norm in the self-attention mechanism is introduced to ensure training stability. Additionally, 3D RoPE is employed, playing a critical role in handling sequences of varying video lengths and resolutions.
149
-
150
- <p align="center">
151
- <img width="80%" src="assets/dit.png">
152
- </p>
153
-
154
- ### 2.3. Video-DPO
155
- In Step-Video-T2V, we incorporate human feedback through Direct Preference Optimization (DPO) to further enhance the visual quality of the generated videos. DPO leverages human preference data to fine-tune the model, ensuring that the generated content aligns more closely with human expectations. The overall DPO pipeline is shown below, highlighting its critical role in improving both the consistency and quality of the video generation process.
156
-
157
- <p align="center">
158
- <img width="100%" src="assets/dpo_pipeline.png">
159
- </p>
160
-
161
-
162
-
163
- ## 3. Model Download
164
- | Models | 🤗Huggingface | 🤖Modelscope |
165
- |:-------:|:-------:|:-------:|
166
- | Step-Video-T2V | [download](https://huggingface.co/stepfun-ai/stepvideo-t2v) | [download](https://www.modelscope.cn/models/stepfun-ai/stepvideo-t2v)
167
- | Step-Video-T2V-Turbo (Inference Step Distillation) | [download](https://huggingface.co/stepfun-ai/stepvideo-t2v-turbo) | [download](https://www.modelscope.cn/models/stepfun-ai/stepvideo-t2v-turbo)
168
 
169
 
170
- ## 4. Model Usage
171
- ### 📜 4.1 Requirements
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
 
173
- The following table shows the requirements for running Step-Video-T2V model (batch size = 1, w/o cfg distillation) to generate videos:
174
 
175
- | Model | height/width/frame | Peak GPU Memory | 50 steps w flash-attn | 50 steps w/o flash-attn |
176
- |:------------:|:------------:|:------------:|:------------:|:------------:|
177
- | Step-Video-T2V | 544px992px204f | 77.64 GB | 743 s | 1232 s |
178
- | Step-Video-T2V | 544px992px136f | 72.48 GB | 408 s | 605 s |
179
 
180
- * An NVIDIA GPU with CUDA support is required.
181
- * The model is tested on four GPUs.
182
- * **Recommended**: We recommend to use GPUs with 80GB of memory for better generation quality.
183
- * Tested operating system: Linux
184
- * The self-attention in text-encoder (step_llm) only supports CUDA capabilities sm_80 sm_86 and sm_90
185
 
186
- ### 🔧 4.2 Dependencies and Installation
187
- - Python >= 3.10.0 (Recommend to use [Anaconda](https://www.anaconda.com/download/#linux) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html))
188
- - [PyTorch >= 2.3-cu121](https://pytorch.org/)
189
- - [CUDA Toolkit](https://developer.nvidia.com/cuda-downloads)
190
- - [FFmpeg](https://www.ffmpeg.org/)
191
- -
192
- ```bash
193
- git clone https://github.com/stepfun-ai/Step-Video-TI2V.git
194
- conda create -n stepvideo python=3.10
195
- conda activate stepvideo
196
 
197
- cd StepFun-StepVideo
198
- pip install -e .
199
 
200
- ```
201
 
202
- ### 🚀 4.3 Inference Scripts
203
- - We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.
204
- ```bash
205
- python api/call_remote_server.py --model_dir where_you_download_dir & ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.
206
 
207
- parallel=4 # or parallel=8
208
- url='127.0.0.1'
209
- model_dir=where_you_download_dir
210
-
211
- torchrun --nproc_per_node $parallel run_parallel.py \
212
- --model_dir $model_dir \
213
- --vae_url $url \
214
- --caption_url $url \
215
- --ulysses_degree $parallel \
216
- --prompt "男孩笑起来" \
217
- --first_image_path ./assets/demo.png \
218
- --infer_steps 50 \
219
- --save_path ./results \
220
- --cfg_scale 9.0 \
221
- --motion_score 5 \
222
- --time_shift 12.573
223
- ```
224
-
225
- ### 🚀 4.4 Best-of-Practice Inference settings
226
- Step-Video-T2V exhibits robust performance in inference settings, consistently generating high-fidelity and dynamic videos. However, our experiments reveal that variations in inference hyperparameters can have a substantial effect on the trade-off between video fidelity and dynamics. To achieve optimal results, we recommend the following best practices for tuning inference parameters:
227
-
228
- | Models | infer_steps | cfg_scale | time_shift | num_frames |
229
- |:-------:|:-------:|:-------:|:-------:|:-------:|
230
- | Step-Video-T2V | 30-50 | 9.0 | 13.0 | 204
231
- | Step-Video-T2V-Turbo (Inference Step Distillation) | 10-15 | 5.0 | 17.0 | 204 |
232
-
233
-
234
- ## 5. Benchmark
235
- We are releasing [Step-Video-T2V Eval](https://github.com/stepfun-ai/Step-Video-T2V/blob/main/benchmark/Step-Video-T2V-Eval) as a new benchmark, featuring 128 Chinese prompts sourced from real users. This benchmark is designed to evaluate the quality of generated videos across 11 distinct categories: Sports, Food, Scenery, Animals, Festivals, Combination Concepts, Surreal, People, 3D Animation, Cinematography, and Style.
236
-
237
- ## 6. Online Engine
238
- The online version of Step-Video-T2V is available on [跃问视频](https://yuewen.cn/videos), where you can also explore some impressive examples.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
239
 
240
- ## 7. Citation
241
- ```
242
- @misc{
243
- }
244
- ```
245
 
246
- ## 8. Acknowledgement
247
- - We would like to express our sincere thanks to the [xDiT](https://github.com/xdit-project/xDiT) team for their invaluable support and parallelization strategy.
248
- - Our code will be integrated into the official repository of [Huggingface/Diffusers](https://github.com/huggingface/diffusers).
249
- - We thank the [FastVideo](https://github.com/hao-ai-lab/FastVideo) team for their continued collaboration and look forward to launching inference acceleration solutions together in the near future.
 
43
  --infer_steps 50 \
44
  --save_path ./results \
45
  --cfg_scale 9.0 \
46
+ --motion_score 5.0 \
47
  --time_shift 12.573
48
  ```
49
  ## Motion Control
 
115
  </tr>
116
  </table>
117
 
118
+ ## 5. Benchmark
119
 
120
+ We build [Step-Video-TI2V-Eval](https://github.com/stepfun-ai/Step-Video-T2V/blob/main/benchmark/Step-Video-T2V-Eval), a new benchmark designed for the text-driven image-to-video generation task. The dataset comprises 178 real-world and 120 anime-style prompt-image pairs, ensuring broad coverage of diverse user scenarios. To achieve comprehensive representation, we developed a fine-grained schema for data collection in both categories.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
121
 
122
 
123
+ <table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
124
+ <tr>
125
+ <th style="width: 20%;">vs. OSTopA</th>
126
+ <th style="width: 20%;">vs. OSTopB</th>
127
+ <th style="width: 20%;">vs. CSTopC</th>
128
+ <th style="width: 20%;">vs. CSTopD</th>
129
+ </tr>
130
+ <tr>
131
+ <td>37-63-79</td>
132
+ <td>101-48-29</td>
133
+ <td>41-46-73</td>
134
+ <td>92-51-18</td>
135
+ </tr>
136
+ <tr>
137
+ <td>40-35-44</td>
138
+ <td>94-16-10</td>
139
+ <td>52-35-47</td>
140
+ <td>87-18-17</td>
141
+ </tr>
142
+ <tr>
143
+ <td>46-92-39</td>
144
+ <td>43-71-64</td>
145
+ <td>45-65-50</td>
146
+ <td>36-77-47</td>
147
+ </tr>
148
+ <tr>
149
+ <td>42-61-18</td>
150
+ <td>50-35-35</td>
151
+ <td>29-62-43</td>
152
+ <td>37-63-23</td>
153
+ </tr>
154
+ <tr>
155
+ <td>52-57-49</td>
156
+ <td>71-40-66</td>
157
+ <td>58-33-69</td>
158
+ <td>67-33-60</td>
159
+ </tr>
160
+ <tr>
161
+ <td>75-17-28</td>
162
+ <td>67-30-24</td>
163
+ <td>78-17-39</td>
164
+ <td>68-41-14</td>
165
+ </tr>
166
+ <tr>
167
+ <th colspan="4">Total Score</th>
168
+ </tr>
169
+ <tr>
170
+ <td>292-325-277</td>
171
+ <td>426-240-228</td>
172
+ <td>303-258-321</td>
173
+ <td>387-283-179</td>
174
+ </tr>
175
+ </table>
176
+ <p style="text-align: center;"><strong>Table 1: Comparison with baseline TI2V models using Step-Video-TI2V-Eval.</strong></p>
177
 
 
178
 
 
 
 
 
179
 
180
+ [VBench](https://arxiv.org/html/2411.13503v1) is a comprehensive benchmark suite that deconstructs “video generation quality” into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. We utilize the VBench-I2V benchmark to assess the performance of Step-Video-TI2V alongside other TI2V models.
 
 
 
 
181
 
 
 
 
 
 
 
 
 
 
 
182
 
 
 
183
 
 
184
 
 
 
 
 
185
 
186
+ <table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
187
+ <tr>
188
+ <th style="width: 20%;">Scores</th>
189
+ <th style="width: 20%;">Step-Video-TI2V (motion=10)</th>
190
+ <th style="width: 20%;">Step-Video-TI2V (motion=5)</th>
191
+ <th style="width: 20%;">OSTopA</th>
192
+ <th style="width: 20%;">OSTopB</th>
193
+ </tr>
194
+ <tr>
195
+ <td><strong>Total Score</strong></td>
196
+ <td><strong>87.98</strong></td>
197
+ <td>87.80</td>
198
+ <td>87.49</td>
199
+ <td>86.77</td>
200
+ </tr>
201
+ <tr>
202
+ <td><strong>I2V Score</strong></td>
203
+ <td>95.11</td>
204
+ <td><strong>95.50</strong></td>
205
+ <td>94.63</td>
206
+ <td>93.25</td>
207
+ </tr>
208
+ <tr>
209
+ <td>Video-Text Camera Motion</td>
210
+ <td>48.15</td>
211
+ <td><strong>49.22</strong></td>
212
+ <td>29.58</td>
213
+ <td>46.45</td>
214
+ </tr>
215
+ <tr>
216
+ <td>Video-Image Subject Consistency</td>
217
+ <td>97.44</td>
218
+ <td><strong>97.85</strong></td>
219
+ <td>97.73</td>
220
+ <td>95.88</td>
221
+ </tr>
222
+ <tr>
223
+ <td>Video-Image Background Consistency</td>
224
+ <td>98.45</td>
225
+ <td>98.63</td>
226
+ <td><strong>98.83</strong></td>
227
+ <td>96.47</td>
228
+ </tr>
229
+ <tr>
230
+ <td><strong>Quality Score</strong></td>
231
+ <td><strong>80.86</strong></td>
232
+ <td>80.11</td>
233
+ <td>80.36</td>
234
+ <td>80.28</td>
235
+ </tr>
236
+ <tr>
237
+ <td>Subject Consistency</td>
238
+ <td>95.62</td>
239
+ <td><strong>96.02</strong></td>
240
+ <td>94.52</td>
241
+ <td><strong>96.28</strong></td>
242
+ </tr>
243
+ <tr>
244
+ <td>Background Consistency</td>
245
+ <td>96.92</td>
246
+ <td>97.06</td>
247
+ <td>96.47</td>
248
+ <td><strong>97.38</strong></td>
249
+ </tr>
250
+ <tr>
251
+ <td>Motion Smoothness</td>
252
+ <td>99.08</td>
253
+ <td><strong>99.24</strong></td>
254
+ <td>98.09</td>
255
+ <td>99.10</td>
256
+ </tr>
257
+ <tr>
258
+ <td>Dynamic Degree</td>
259
+ <td>48.78</td>
260
+ <td>36.58</td>
261
+ <td><strong>53.41</strong></td>
262
+ <td>38.13</td>
263
+ </tr>
264
+ <tr>
265
+ <td>Aesthetic Quality</td>
266
+ <td>61.74</td>
267
+ <td><strong>62.29</strong></td>
268
+ <td>61.04</td>
269
+ <td>61.82</td>
270
+ </tr>
271
+ <tr>
272
+ <td>Imaging Quality</td>
273
+ <td>70.17</td>
274
+ <td>70.43</td>
275
+ <td><strong>71.12</strong></td>
276
+ <td>70.82</td>
277
+ </tr>
278
+ </table>
279
 
 
 
 
 
 
280
 
281
+ <p style="text-align: center;"><strong>Table 3: Comparison with two open-source TI2V models using VBench-I2V.</strong></p>