Update README.md
Browse files
README.md
CHANGED
@@ -24,7 +24,7 @@ pipeline_tag: image-to-video
|
|
24 |
* Mar 17, 2025: 🎉 We have made our technical report available as open source. [Read](https://arxiv.org/abs/2502.10248)
|
25 |
|
26 |
|
27 |
-
|
28 |
- We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.
|
29 |
```bash
|
30 |
python api/call_remote_server.py --model_dir where_you_download_dir & ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.
|
@@ -46,209 +46,12 @@ torchrun --nproc_per_node $parallel run_parallel.py \
|
|
46 |
--motion_score 5.0 \
|
47 |
--time_shift 12.573
|
48 |
```
|
49 |
-
## Motion Control
|
50 |
|
51 |
-
|
52 |
-
<tr>
|
53 |
-
<td><video src="https://github.com/user-attachments/assets/3c6a5c8d-ada4-484f-8f3d-f2a99ef18a4b" width="30%" controls autoplay loop muted></video></td>
|
54 |
-
<td><video src="https://github.com/user-attachments/assets/90c608d9-b3cf-40fa-b4ee-21b682c840ae" width="30%" controls autoplay loop muted></video></td>
|
55 |
-
<td><video src="https://github.com/user-attachments/assets/e58d3a6b-0076-4587-aac5-6911ba4c776d" width="30%" controls autoplay loop muted></video></td>
|
56 |
-
</tr>
|
57 |
-
</table>
|
58 |
|
59 |
-
|
60 |
-
|
61 |
-
|
62 |
-
|
63 |
-
|
64 |
-
|
65 |
-
<th style="width: 33%;">Motion = 10</th>
|
66 |
-
</tr>
|
67 |
-
<tr>
|
68 |
-
<td><video src="https://github.com/user-attachments/assets/0d6b1813-2bf0-462a-8ad4-c0583d83afc5" width="33%" controls autoplay loop muted></video></td>
|
69 |
-
<td><video src="https://github.com/user-attachments/assets/33699654-93cc-4205-8a47-93ece4282f72" width="33%" controls autoplay loop muted></video></td>
|
70 |
-
<td><video src="https://github.com/user-attachments/assets/52d73eb5-2c68-4de3-9019-516243804b2c" width="33%" controls autoplay loop muted></video></td>
|
71 |
-
</tr>
|
72 |
-
</table>
|
73 |
-
|
74 |
-
<table border="0" style="width: 100%; text-align: center; margin-top: 10px;">
|
75 |
-
<tr>
|
76 |
-
<th style="width: 33%;">Motion = 2</th>
|
77 |
-
<th style="width: 33%;">Motion = 5</th>
|
78 |
-
<th style="width: 33%;">Motion = 20</th>
|
79 |
-
</tr>
|
80 |
-
<tr>
|
81 |
-
<td><video src="https://github.com/user-attachments/assets/31c48385-fe83-4961-bd42-7bd2b1edeb19" width="33%" controls autoplay loop muted></video></td>
|
82 |
-
<td><video src="https://github.com/user-attachments/assets/913a407e-55ca-4a33-bafe-bd5e38eec5f5" width="33%" controls autoplay loop muted></video></td>
|
83 |
-
<td><video src="https://github.com/user-attachments/assets/119a3673-014f-4772-b846-718307a4a412" width="33%" controls autoplay loop muted></video></td>
|
84 |
-
</tr>
|
85 |
-
</table>
|
86 |
-
|
87 |
-
🎯 Tips
|
88 |
-
The default motion_score = 5 is suitable for general use. If you need more stability, set motion_score = 2, though it may be less responsive to certain movements. For greater movement flexibility, you can use motion_score = 10 or motion_score = 20 to enable more intense actions. Feel free to customize the motion_score based on your creative needs to fit different use cases.
|
89 |
-
|
90 |
-
## Camera Control
|
91 |
-
|
92 |
-
<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
|
93 |
-
<tr>
|
94 |
-
<th style="width: 33%;">镜头环绕</th>
|
95 |
-
<th style="width: 33%;">镜头推进</th>
|
96 |
-
<th style="width: 33%;">镜头拉远</th>
|
97 |
-
</tr>
|
98 |
-
<tr>
|
99 |
-
<td><video src="https://github.com/user-attachments/assets/257847bc-5967-45ba-a649-505859476aad" height="30%" controls autoplay loop muted></video></td>
|
100 |
-
<td><video src="https://github.com/user-attachments/assets/d310502a-4f7e-4a78-882f-95c46b4dfe67" height="30%" controls autoplay loop muted></video></td>
|
101 |
-
<td><video src="https://github.com/user-attachments/assets/f6426fc7-2a18-474c-9766-fc8ae8d8d40d" height="30%" controls autoplay loop muted></video></td>
|
102 |
-
</tr>
|
103 |
-
</table>
|
104 |
-
|
105 |
-
<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
|
106 |
-
<tr>
|
107 |
-
<th style="width: 33%;">镜头固定</th>
|
108 |
-
<th style="width: 33%;">镜头左移</th>
|
109 |
-
<th style="width: 33%;">镜头右摇</th>
|
110 |
-
</tr>
|
111 |
-
<tr>
|
112 |
-
<td><video src="https://github.com/user-attachments/assets/f78f76a0-afe1-41b1-9914-f2f508c6ea50" width="30%" controls autoplay loop muted></video></td>
|
113 |
-
<td><video src="https://github.com/user-attachments/assets/3894ec0f-d483-41fe-8331-68b6e5bf6544" width="30%" controls autoplay loop muted></video></td>
|
114 |
-
<td><video src="https://github.com/user-attachments/assets/9de3aa20-c797-4dac-bef1-ee064ed96ed4" width="30%" controls autoplay loop muted></video></td>
|
115 |
-
</tr>
|
116 |
-
</table>
|
117 |
-
|
118 |
-
## 5. Benchmark
|
119 |
-
|
120 |
-
We build [Step-Video-TI2V-Eval](https://github.com/stepfun-ai/Step-Video-T2V/blob/main/benchmark/Step-Video-T2V-Eval), a new benchmark designed for the text-driven image-to-video generation task. The dataset comprises 178 real-world and 120 anime-style prompt-image pairs, ensuring broad coverage of diverse user scenarios. To achieve comprehensive representation, we developed a fine-grained schema for data collection in both categories.
|
121 |
-
|
122 |
-
|
123 |
-
|
124 |
-
<table border="0" style="width: 100%; text-align: center; margin-top: 10px; border-collapse: collapse; border-radius: 8px; overflow: hidden;">
|
125 |
-
<thead>
|
126 |
-
<tr style="">
|
127 |
-
<th style="width: 25%; padding: 10px;">vs. OSTopA</th>
|
128 |
-
<th style="width: 25%; padding: 10px;">vs. OSTopB</th>
|
129 |
-
<th style="width: 25%; padding: 10px;">vs. CSTopC</th>
|
130 |
-
<th style="width: 25%; padding: 10px;">vs. CSTopD</th>
|
131 |
-
</tr>
|
132 |
-
</thead>
|
133 |
-
<tbody>
|
134 |
-
<tr><td>37-63-79</td><td>101-48-29</td><td>41-46-73</td><td>92-51-18</td></tr>
|
135 |
-
<tr><td>40-35-44</td><td>94-16-10</td><td>52-35-47</td><td>87-18-17</td></tr>
|
136 |
-
<tr><td>46-92-39</td><td>43-71-64</td><td>45-65-50</td><td>36-77-47</td></tr>
|
137 |
-
<tr><td>42-61-18</td><td>50-35-35</td><td>29-62-43</td><td>37-63-23</td></tr>
|
138 |
-
<tr><td>52-57-49</td><td>71-40-66</td><td>58-33-69</td><td>67-33-60</td></tr>
|
139 |
-
<tr><td>75-17-28</td><td>67-30-24</td><td>78-17-39</td><td>68-41-14</td></tr>
|
140 |
-
<tr style="">
|
141 |
-
<td colspan="4" style="padding: 10px; font-weight: bold;">Total Score</td>
|
142 |
-
</tr>
|
143 |
-
<tr>
|
144 |
-
<td>292-325-277</td>
|
145 |
-
<td>426-240-228</td>
|
146 |
-
<td>303-258-321</td>
|
147 |
-
<td>387-283-179</td>
|
148 |
-
</tr>
|
149 |
-
</tbody>
|
150 |
-
</table>
|
151 |
-
|
152 |
-
|
153 |
-
[VBench](https://arxiv.org/html/2411.13503v1) is a comprehensive benchmark suite that deconstructs “video generation quality” into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. We utilize the VBench-I2V benchmark to assess the performance of Step-Video-TI2V alongside other TI2V models.
|
154 |
-
|
155 |
-
|
156 |
-
|
157 |
-
|
158 |
-
|
159 |
-
<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
|
160 |
-
<tr>
|
161 |
-
<th style="width: 20%;">Scores</th>
|
162 |
-
<th style="width: 20%;">Step-Video-TI2V (motion=10)</th>
|
163 |
-
<th style="width: 20%;">Step-Video-TI2V (motion=5)</th>
|
164 |
-
<th style="width: 20%;">OSTopA</th>
|
165 |
-
<th style="width: 20%;">OSTopB</th>
|
166 |
-
</tr>
|
167 |
-
<tr>
|
168 |
-
<td><strong>Total Score</strong></td>
|
169 |
-
<td><strong>87.98</strong></td>
|
170 |
-
<td>87.80</td>
|
171 |
-
<td>87.49</td>
|
172 |
-
<td>86.77</td>
|
173 |
-
</tr>
|
174 |
-
<tr>
|
175 |
-
<td><strong>I2V Score</strong></td>
|
176 |
-
<td>95.11</td>
|
177 |
-
<td><strong>95.50</strong></td>
|
178 |
-
<td>94.63</td>
|
179 |
-
<td>93.25</td>
|
180 |
-
</tr>
|
181 |
-
<tr>
|
182 |
-
<td>Video-Text Camera Motion</td>
|
183 |
-
<td>48.15</td>
|
184 |
-
<td><strong>49.22</strong></td>
|
185 |
-
<td>29.58</td>
|
186 |
-
<td>46.45</td>
|
187 |
-
</tr>
|
188 |
-
<tr>
|
189 |
-
<td>Video-Image Subject Consistency</td>
|
190 |
-
<td>97.44</td>
|
191 |
-
<td><strong>97.85</strong></td>
|
192 |
-
<td>97.73</td>
|
193 |
-
<td>95.88</td>
|
194 |
-
</tr>
|
195 |
-
<tr>
|
196 |
-
<td>Video-Image Background Consistency</td>
|
197 |
-
<td>98.45</td>
|
198 |
-
<td>98.63</td>
|
199 |
-
<td><strong>98.83</strong></td>
|
200 |
-
<td>96.47</td>
|
201 |
-
</tr>
|
202 |
-
<tr>
|
203 |
-
<td><strong>Quality Score</strong></td>
|
204 |
-
<td><strong>80.86</strong></td>
|
205 |
-
<td>80.11</td>
|
206 |
-
<td>80.36</td>
|
207 |
-
<td>80.28</td>
|
208 |
-
</tr>
|
209 |
-
<tr>
|
210 |
-
<td>Subject Consistency</td>
|
211 |
-
<td>95.62</td>
|
212 |
-
<td><strong>96.02</strong></td>
|
213 |
-
<td>94.52</td>
|
214 |
-
<td><strong>96.28</strong></td>
|
215 |
-
</tr>
|
216 |
-
<tr>
|
217 |
-
<td>Background Consistency</td>
|
218 |
-
<td>96.92</td>
|
219 |
-
<td>97.06</td>
|
220 |
-
<td>96.47</td>
|
221 |
-
<td><strong>97.38</strong></td>
|
222 |
-
</tr>
|
223 |
-
<tr>
|
224 |
-
<td>Motion Smoothness</td>
|
225 |
-
<td>99.08</td>
|
226 |
-
<td><strong>99.24</strong></td>
|
227 |
-
<td>98.09</td>
|
228 |
-
<td>99.10</td>
|
229 |
-
</tr>
|
230 |
-
<tr>
|
231 |
-
<td>Dynamic Degree</td>
|
232 |
-
<td>48.78</td>
|
233 |
-
<td>36.58</td>
|
234 |
-
<td><strong>53.41</strong></td>
|
235 |
-
<td>38.13</td>
|
236 |
-
</tr>
|
237 |
-
<tr>
|
238 |
-
<td>Aesthetic Quality</td>
|
239 |
-
<td>61.74</td>
|
240 |
-
<td><strong>62.29</strong></td>
|
241 |
-
<td>61.04</td>
|
242 |
-
<td>61.82</td>
|
243 |
-
</tr>
|
244 |
-
<tr>
|
245 |
-
<td>Imaging Quality</td>
|
246 |
-
<td>70.17</td>
|
247 |
-
<td>70.43</td>
|
248 |
-
<td><strong>71.12</strong></td>
|
249 |
-
<td>70.82</td>
|
250 |
-
</tr>
|
251 |
-
</table>
|
252 |
-
|
253 |
-
|
254 |
-
<p style="text-align: center;"><strong>Table 3: Comparison with two open-source TI2V models using VBench-I2V.</strong></p>
|
|
|
24 |
* Mar 17, 2025: 🎉 We have made our technical report available as open source. [Read](https://arxiv.org/abs/2502.10248)
|
25 |
|
26 |
|
27 |
+
## 🚀 Inference Scripts
|
28 |
- We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.
|
29 |
```bash
|
30 |
python api/call_remote_server.py --model_dir where_you_download_dir & ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.
|
|
|
46 |
--motion_score 5.0 \
|
47 |
--time_shift 12.573
|
48 |
```
|
|
|
49 |
|
50 |
+
The following table shows the requirements for running Step-Video-T2V model (batch size = 1, w/o cfg distillation) to generate videos:
|
|
|
|
|
|
|
|
|
|
|
|
|
51 |
|
52 |
+
| GPU | height/width/frame | Peak GPU Memory | 50 steps |
|
53 |
+
|------|--------------------|-----------------|----------|
|
54 |
+
| 1 | 768px × 768px × 102f | 76.42 GB | 1061s |
|
55 |
+
| 1 | 544px × 992px × 102f | 75.49 GB | 929s |
|
56 |
+
| 4 | 768px × 768px × 102f | 64.63 GB | 288s |
|
57 |
+
| 4 | 544px × 992px × 102f | 64.34 GB | 251s |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|