bwang3579 commited on
Commit
bc9b6d4
·
verified ·
1 Parent(s): 594ab4e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -205
README.md CHANGED
@@ -24,7 +24,7 @@ pipeline_tag: image-to-video
24
  * Mar 17, 2025: 🎉 We have made our technical report available as open source. [Read](https://arxiv.org/abs/2502.10248)
25
 
26
 
27
- ### 🚀 Inference Scripts
28
  - We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.
29
  ```bash
30
  python api/call_remote_server.py --model_dir where_you_download_dir & ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.
@@ -46,209 +46,12 @@ torchrun --nproc_per_node $parallel run_parallel.py \
46
  --motion_score 5.0 \
47
  --time_shift 12.573
48
  ```
49
- ## Motion Control
50
 
51
- <table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
52
- <tr>
53
- <td><video src="https://github.com/user-attachments/assets/3c6a5c8d-ada4-484f-8f3d-f2a99ef18a4b" width="30%" controls autoplay loop muted></video></td>
54
- <td><video src="https://github.com/user-attachments/assets/90c608d9-b3cf-40fa-b4ee-21b682c840ae" width="30%" controls autoplay loop muted></video></td>
55
- <td><video src="https://github.com/user-attachments/assets/e58d3a6b-0076-4587-aac5-6911ba4c776d" width="30%" controls autoplay loop muted></video></td>
56
- </tr>
57
- </table>
58
 
59
- ## Motion Amplitude Control
60
-
61
- <table border="0" style="width: 100%; text-align: center; margin-top: 10px;">
62
- <tr>
63
- <th style="width: 33%;">Motion = 2</th>
64
- <th style="width: 33%;">Motion = 5</th>
65
- <th style="width: 33%;">Motion = 10</th>
66
- </tr>
67
- <tr>
68
- <td><video src="https://github.com/user-attachments/assets/0d6b1813-2bf0-462a-8ad4-c0583d83afc5" width="33%" controls autoplay loop muted></video></td>
69
- <td><video src="https://github.com/user-attachments/assets/33699654-93cc-4205-8a47-93ece4282f72" width="33%" controls autoplay loop muted></video></td>
70
- <td><video src="https://github.com/user-attachments/assets/52d73eb5-2c68-4de3-9019-516243804b2c" width="33%" controls autoplay loop muted></video></td>
71
- </tr>
72
- </table>
73
-
74
- <table border="0" style="width: 100%; text-align: center; margin-top: 10px;">
75
- <tr>
76
- <th style="width: 33%;">Motion = 2</th>
77
- <th style="width: 33%;">Motion = 5</th>
78
- <th style="width: 33%;">Motion = 20</th>
79
- </tr>
80
- <tr>
81
- <td><video src="https://github.com/user-attachments/assets/31c48385-fe83-4961-bd42-7bd2b1edeb19" width="33%" controls autoplay loop muted></video></td>
82
- <td><video src="https://github.com/user-attachments/assets/913a407e-55ca-4a33-bafe-bd5e38eec5f5" width="33%" controls autoplay loop muted></video></td>
83
- <td><video src="https://github.com/user-attachments/assets/119a3673-014f-4772-b846-718307a4a412" width="33%" controls autoplay loop muted></video></td>
84
- </tr>
85
- </table>
86
-
87
- 🎯 Tips
88
- The default motion_score = 5 is suitable for general use. If you need more stability, set motion_score = 2, though it may be less responsive to certain movements. For greater movement flexibility, you can use motion_score = 10 or motion_score = 20 to enable more intense actions. Feel free to customize the motion_score based on your creative needs to fit different use cases.
89
-
90
- ## Camera Control
91
-
92
- <table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
93
- <tr>
94
- <th style="width: 33%;">镜头环绕</th>
95
- <th style="width: 33%;">镜头推进</th>
96
- <th style="width: 33%;">镜头拉远</th>
97
- </tr>
98
- <tr>
99
- <td><video src="https://github.com/user-attachments/assets/257847bc-5967-45ba-a649-505859476aad" height="30%" controls autoplay loop muted></video></td>
100
- <td><video src="https://github.com/user-attachments/assets/d310502a-4f7e-4a78-882f-95c46b4dfe67" height="30%" controls autoplay loop muted></video></td>
101
- <td><video src="https://github.com/user-attachments/assets/f6426fc7-2a18-474c-9766-fc8ae8d8d40d" height="30%" controls autoplay loop muted></video></td>
102
- </tr>
103
- </table>
104
-
105
- <table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
106
- <tr>
107
- <th style="width: 33%;">镜头固定</th>
108
- <th style="width: 33%;">镜头左移</th>
109
- <th style="width: 33%;">镜头右摇</th>
110
- </tr>
111
- <tr>
112
- <td><video src="https://github.com/user-attachments/assets/f78f76a0-afe1-41b1-9914-f2f508c6ea50" width="30%" controls autoplay loop muted></video></td>
113
- <td><video src="https://github.com/user-attachments/assets/3894ec0f-d483-41fe-8331-68b6e5bf6544" width="30%" controls autoplay loop muted></video></td>
114
- <td><video src="https://github.com/user-attachments/assets/9de3aa20-c797-4dac-bef1-ee064ed96ed4" width="30%" controls autoplay loop muted></video></td>
115
- </tr>
116
- </table>
117
-
118
- ## 5. Benchmark
119
-
120
- We build [Step-Video-TI2V-Eval](https://github.com/stepfun-ai/Step-Video-T2V/blob/main/benchmark/Step-Video-T2V-Eval), a new benchmark designed for the text-driven image-to-video generation task. The dataset comprises 178 real-world and 120 anime-style prompt-image pairs, ensuring broad coverage of diverse user scenarios. To achieve comprehensive representation, we developed a fine-grained schema for data collection in both categories.
121
-
122
-
123
-
124
- <table border="0" style="width: 100%; text-align: center; margin-top: 10px; border-collapse: collapse; border-radius: 8px; overflow: hidden;">
125
- <thead>
126
- <tr style="">
127
- <th style="width: 25%; padding: 10px;">vs. OSTopA</th>
128
- <th style="width: 25%; padding: 10px;">vs. OSTopB</th>
129
- <th style="width: 25%; padding: 10px;">vs. CSTopC</th>
130
- <th style="width: 25%; padding: 10px;">vs. CSTopD</th>
131
- </tr>
132
- </thead>
133
- <tbody>
134
- <tr><td>37-63-79</td><td>101-48-29</td><td>41-46-73</td><td>92-51-18</td></tr>
135
- <tr><td>40-35-44</td><td>94-16-10</td><td>52-35-47</td><td>87-18-17</td></tr>
136
- <tr><td>46-92-39</td><td>43-71-64</td><td>45-65-50</td><td>36-77-47</td></tr>
137
- <tr><td>42-61-18</td><td>50-35-35</td><td>29-62-43</td><td>37-63-23</td></tr>
138
- <tr><td>52-57-49</td><td>71-40-66</td><td>58-33-69</td><td>67-33-60</td></tr>
139
- <tr><td>75-17-28</td><td>67-30-24</td><td>78-17-39</td><td>68-41-14</td></tr>
140
- <tr style="">
141
- <td colspan="4" style="padding: 10px; font-weight: bold;">Total Score</td>
142
- </tr>
143
- <tr>
144
- <td>292-325-277</td>
145
- <td>426-240-228</td>
146
- <td>303-258-321</td>
147
- <td>387-283-179</td>
148
- </tr>
149
- </tbody>
150
- </table>
151
-
152
-
153
- [VBench](https://arxiv.org/html/2411.13503v1) is a comprehensive benchmark suite that deconstructs “video generation quality” into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. We utilize the VBench-I2V benchmark to assess the performance of Step-Video-TI2V alongside other TI2V models.
154
-
155
-
156
-
157
-
158
-
159
- <table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
160
- <tr>
161
- <th style="width: 20%;">Scores</th>
162
- <th style="width: 20%;">Step-Video-TI2V (motion=10)</th>
163
- <th style="width: 20%;">Step-Video-TI2V (motion=5)</th>
164
- <th style="width: 20%;">OSTopA</th>
165
- <th style="width: 20%;">OSTopB</th>
166
- </tr>
167
- <tr>
168
- <td><strong>Total Score</strong></td>
169
- <td><strong>87.98</strong></td>
170
- <td>87.80</td>
171
- <td>87.49</td>
172
- <td>86.77</td>
173
- </tr>
174
- <tr>
175
- <td><strong>I2V Score</strong></td>
176
- <td>95.11</td>
177
- <td><strong>95.50</strong></td>
178
- <td>94.63</td>
179
- <td>93.25</td>
180
- </tr>
181
- <tr>
182
- <td>Video-Text Camera Motion</td>
183
- <td>48.15</td>
184
- <td><strong>49.22</strong></td>
185
- <td>29.58</td>
186
- <td>46.45</td>
187
- </tr>
188
- <tr>
189
- <td>Video-Image Subject Consistency</td>
190
- <td>97.44</td>
191
- <td><strong>97.85</strong></td>
192
- <td>97.73</td>
193
- <td>95.88</td>
194
- </tr>
195
- <tr>
196
- <td>Video-Image Background Consistency</td>
197
- <td>98.45</td>
198
- <td>98.63</td>
199
- <td><strong>98.83</strong></td>
200
- <td>96.47</td>
201
- </tr>
202
- <tr>
203
- <td><strong>Quality Score</strong></td>
204
- <td><strong>80.86</strong></td>
205
- <td>80.11</td>
206
- <td>80.36</td>
207
- <td>80.28</td>
208
- </tr>
209
- <tr>
210
- <td>Subject Consistency</td>
211
- <td>95.62</td>
212
- <td><strong>96.02</strong></td>
213
- <td>94.52</td>
214
- <td><strong>96.28</strong></td>
215
- </tr>
216
- <tr>
217
- <td>Background Consistency</td>
218
- <td>96.92</td>
219
- <td>97.06</td>
220
- <td>96.47</td>
221
- <td><strong>97.38</strong></td>
222
- </tr>
223
- <tr>
224
- <td>Motion Smoothness</td>
225
- <td>99.08</td>
226
- <td><strong>99.24</strong></td>
227
- <td>98.09</td>
228
- <td>99.10</td>
229
- </tr>
230
- <tr>
231
- <td>Dynamic Degree</td>
232
- <td>48.78</td>
233
- <td>36.58</td>
234
- <td><strong>53.41</strong></td>
235
- <td>38.13</td>
236
- </tr>
237
- <tr>
238
- <td>Aesthetic Quality</td>
239
- <td>61.74</td>
240
- <td><strong>62.29</strong></td>
241
- <td>61.04</td>
242
- <td>61.82</td>
243
- </tr>
244
- <tr>
245
- <td>Imaging Quality</td>
246
- <td>70.17</td>
247
- <td>70.43</td>
248
- <td><strong>71.12</strong></td>
249
- <td>70.82</td>
250
- </tr>
251
- </table>
252
-
253
-
254
- <p style="text-align: center;"><strong>Table 3: Comparison with two open-source TI2V models using VBench-I2V.</strong></p>
 
24
  * Mar 17, 2025: 🎉 We have made our technical report available as open source. [Read](https://arxiv.org/abs/2502.10248)
25
 
26
 
27
+ ## 🚀 Inference Scripts
28
  - We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.
29
  ```bash
30
  python api/call_remote_server.py --model_dir where_you_download_dir & ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.
 
46
  --motion_score 5.0 \
47
  --time_shift 12.573
48
  ```
 
49
 
50
+ The following table shows the requirements for running Step-Video-T2V model (batch size = 1, w/o cfg distillation) to generate videos:
 
 
 
 
 
 
51
 
52
+ | GPU | height/width/frame | Peak GPU Memory | 50 steps |
53
+ |------|--------------------|-----------------|----------|
54
+ | 1 | 768px × 768px × 102f | 76.42 GB | 1061s |
55
+ | 1 | 544px × 992px × 102f | 75.49 GB | 929s |
56
+ | 4 | 768px × 768px × 102f | 64.63 GB | 288s |
57
+ | 4 | 544px × 992px × 102f | 64.34 GB | 251s |