stepfun-ai
/

stepvideo-ti2v

Image-to-Video

Diffusers

Safetensors

StepVideoPipeline

Model card Files Files and versions Community

bwang3579 commited on 9 days ago

Commit

bc9b6d4

verified ·

1 Parent(s): 594ab4e

Update README.md

Browse files

Files changed (1) hide show

README.md +8 -205

README.md CHANGED Viewed

@@ -24,7 +24,7 @@ pipeline_tag: image-to-video
 * Mar 17, 2025: 🎉 We have made our technical report available as open source. [Read](https://arxiv.org/abs/2502.10248)
-###  🚀 Inference Scripts
 - We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.
 ```bash
 python api/call_remote_server.py --model_dir where_you_download_dir &  ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.
@@ -46,209 +46,12 @@ torchrun --nproc_per_node $parallel run_parallel.py \
     --motion_score 5.0 \
     --time_shift 12.573
 ```
-## Motion Control
-<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
-  <tr>
-    <td><video src="https://github.com/user-attachments/assets/3c6a5c8d-ada4-484f-8f3d-f2a99ef18a4b" width="30%" controls autoplay loop muted></video></td>
-    <td><video src="https://github.com/user-attachments/assets/90c608d9-b3cf-40fa-b4ee-21b682c840ae" width="30%" controls autoplay loop muted></video></td>
-    <td><video src="https://github.com/user-attachments/assets/e58d3a6b-0076-4587-aac5-6911ba4c776d" width="30%" controls autoplay loop muted></video></td>
-  </tr>
-</table>
-## Motion Amplitude Control
-<table border="0" style="width: 100%; text-align: center; margin-top: 10px;">
-  <tr>
-    <th style="width: 33%;">Motion = 2</th>
-    <th style="width: 33%;">Motion = 5</th>
-    <th style="width: 33%;">Motion = 10</th>
-  </tr>
-  <tr>
-    <td><video src="https://github.com/user-attachments/assets/0d6b1813-2bf0-462a-8ad4-c0583d83afc5" width="33%" controls autoplay loop muted></video></td>
-    <td><video src="https://github.com/user-attachments/assets/33699654-93cc-4205-8a47-93ece4282f72" width="33%" controls autoplay loop muted></video></td>
-    <td><video src="https://github.com/user-attachments/assets/52d73eb5-2c68-4de3-9019-516243804b2c" width="33%" controls autoplay loop muted></video></td>
-  </tr>
-</table>
-<table border="0" style="width: 100%; text-align: center; margin-top: 10px;">
-  <tr>
-    <th style="width: 33%;">Motion = 2</th>
-    <th style="width: 33%;">Motion = 5</th>
-    <th style="width: 33%;">Motion = 20</th>
-  </tr>
-  <tr>
-    <td><video src="https://github.com/user-attachments/assets/31c48385-fe83-4961-bd42-7bd2b1edeb19" width="33%" controls autoplay loop muted></video></td>
-    <td><video src="https://github.com/user-attachments/assets/913a407e-55ca-4a33-bafe-bd5e38eec5f5" width="33%" controls autoplay loop muted></video></td>
-    <td><video src="https://github.com/user-attachments/assets/119a3673-014f-4772-b846-718307a4a412" width="33%" controls autoplay loop muted></video></td>
-  </tr>
-</table>
-🎯 Tips
-The default motion_score = 5 is suitable for general use. If you need more stability, set motion_score = 2, though it may be less responsive to certain movements. For greater movement flexibility, you can use motion_score = 10 or motion_score = 20 to enable more intense actions. Feel free to customize the motion_score based on your creative needs to fit different use cases.
-## Camera Control
-<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
-  <tr>
-    <th style="width: 33%;">镜头环绕</th>
-    <th style="width: 33%;">镜头推进</th>
-    <th style="width: 33%;">镜头拉远</th>
-  </tr>
-  <tr>
-    <td><video src="https://github.com/user-attachments/assets/257847bc-5967-45ba-a649-505859476aad" height="30%" controls autoplay loop muted></video></td>
-    <td><video src="https://github.com/user-attachments/assets/d310502a-4f7e-4a78-882f-95c46b4dfe67" height="30%" controls autoplay loop muted></video></td>
-    <td><video src="https://github.com/user-attachments/assets/f6426fc7-2a18-474c-9766-fc8ae8d8d40d" height="30%" controls autoplay loop muted></video></td>
-  </tr>
-</table>
-<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
-  <tr>
-    <th style="width: 33%;">镜头固定</th>
-    <th style="width: 33%;">镜头左移</th>
-    <th style="width: 33%;">镜头右摇</th>
-  </tr>
-  <tr>
-    <td><video src="https://github.com/user-attachments/assets/f78f76a0-afe1-41b1-9914-f2f508c6ea50" width="30%" controls autoplay loop muted></video></td>
-    <td><video src="https://github.com/user-attachments/assets/3894ec0f-d483-41fe-8331-68b6e5bf6544" width="30%" controls autoplay loop muted></video></td>
-    <td><video src="https://github.com/user-attachments/assets/9de3aa20-c797-4dac-bef1-ee064ed96ed4" width="30%" controls autoplay loop muted></video></td>
-  </tr>
-</table>
-## 5. Benchmark
-We build [Step-Video-TI2V-Eval](https://github.com/stepfun-ai/Step-Video-T2V/blob/main/benchmark/Step-Video-T2V-Eval), a new benchmark designed for the text-driven image-to-video generation task. The dataset comprises 178 real-world and 120 anime-style prompt-image pairs, ensuring broad coverage of diverse user scenarios. To achieve comprehensive representation, we developed a fine-grained schema for data collection in both categories.
-<table border="0" style="width: 100%; text-align: center; margin-top: 10px; border-collapse: collapse; border-radius: 8px; overflow: hidden;">
-  <thead>
-    <tr style="">
-      <th style="width: 25%; padding: 10px;">vs. OSTopA</th>
-      <th style="width: 25%; padding: 10px;">vs. OSTopB</th>
-      <th style="width: 25%; padding: 10px;">vs. CSTopC</th>
-      <th style="width: 25%; padding: 10px;">vs. CSTopD</th>
-    </tr>
-  </thead>
-  <tbody>
-    <tr><td>37-63-79</td><td>101-48-29</td><td>41-46-73</td><td>92-51-18</td></tr>
-    <tr><td>40-35-44</td><td>94-16-10</td><td>52-35-47</td><td>87-18-17</td></tr>
-    <tr><td>46-92-39</td><td>43-71-64</td><td>45-65-50</td><td>36-77-47</td></tr>
-    <tr><td>42-61-18</td><td>50-35-35</td><td>29-62-43</td><td>37-63-23</td></tr>
-    <tr><td>52-57-49</td><td>71-40-66</td><td>58-33-69</td><td>67-33-60</td></tr>
-    <tr><td>75-17-28</td><td>67-30-24</td><td>78-17-39</td><td>68-41-14</td></tr>
-    <tr style="">
-      <td colspan="4" style="padding: 10px; font-weight: bold;">Total Score</td>
-    </tr>
-    <tr>
-      <td>292-325-277</td>
-      <td>426-240-228</td>
-      <td>303-258-321</td>
-      <td>387-283-179</td>
-    </tr>
-  </tbody>
-</table>
-[VBench](https://arxiv.org/html/2411.13503v1) is a comprehensive benchmark suite that deconstructs “video generation quality” into specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. We utilize the VBench-I2V benchmark to assess the performance of Step-Video-TI2V alongside other TI2V models.
-<table border="0" style="width: 100%; text-align: center; margin-top: 1px;">
-  <tr>
-    <th style="width: 20%;">Scores</th>
-    <th style="width: 20%;">Step-Video-TI2V (motion=10)</th>
-    <th style="width: 20%;">Step-Video-TI2V (motion=5)</th>
-    <th style="width: 20%;">OSTopA</th>
-    <th style="width: 20%;">OSTopB</th>
-  </tr>
-  <tr>
-    <td><strong>Total Score</strong></td>
-    <td><strong>87.98</strong></td>
-    <td>87.80</td>
-    <td>87.49</td>
-    <td>86.77</td>
-  </tr>
-  <tr>
-    <td><strong>I2V Score</strong></td>
-    <td>95.11</td>
-    <td><strong>95.50</strong></td>
-    <td>94.63</td>
-    <td>93.25</td>
-  </tr>
-  <tr>
-    <td>Video-Text Camera Motion</td>
-    <td>48.15</td>
-    <td><strong>49.22</strong></td>
-    <td>29.58</td>
-    <td>46.45</td>
-  </tr>
-  <tr>
-    <td>Video-Image Subject Consistency</td>
-    <td>97.44</td>
-    <td><strong>97.85</strong></td>
-    <td>97.73</td>
-    <td>95.88</td>
-  </tr>
-  <tr>
-    <td>Video-Image Background Consistency</td>
-    <td>98.45</td>
-    <td>98.63</td>
-    <td><strong>98.83</strong></td>
-    <td>96.47</td>
-  </tr>
-  <tr>
-    <td><strong>Quality Score</strong></td>
-    <td><strong>80.86</strong></td>
-    <td>80.11</td>
-    <td>80.36</td>
-    <td>80.28</td>
-  </tr>
-  <tr>
-    <td>Subject Consistency</td>
-    <td>95.62</td>
-    <td><strong>96.02</strong></td>
-    <td>94.52</td>
-    <td><strong>96.28</strong></td>
-  </tr>
-  <tr>
-    <td>Background Consistency</td>
-    <td>96.92</td>
-    <td>97.06</td>
-    <td>96.47</td>
-    <td><strong>97.38</strong></td>
-  </tr>
-  <tr>
-    <td>Motion Smoothness</td>
-    <td>99.08</td>
-    <td><strong>99.24</strong></td>
-    <td>98.09</td>
-    <td>99.10</td>
-  </tr>
-  <tr>
-    <td>Dynamic Degree</td>
-    <td>48.78</td>
-    <td>36.58</td>
-    <td><strong>53.41</strong></td>
-    <td>38.13</td>
-  </tr>
-  <tr>
-    <td>Aesthetic Quality</td>
-    <td>61.74</td>
-    <td><strong>62.29</strong></td>
-    <td>61.04</td>
-    <td>61.82</td>
-  </tr>
-  <tr>
-    <td>Imaging Quality</td>
-    <td>70.17</td>
-    <td>70.43</td>
-    <td><strong>71.12</strong></td>
-    <td>70.82</td>
-  </tr>
-</table>
-<p style="text-align: center;"><strong>Table 3: Comparison with two open-source TI2V models using VBench-I2V.</strong></p>

 * Mar 17, 2025: 🎉 We have made our technical report available as open source. [Read](https://arxiv.org/abs/2502.10248)
+##  🚀 Inference Scripts
 - We employed a decoupling strategy for the text encoder, VAE decoding, and DiT to optimize GPU resource utilization by DiT. As a result, a dedicated GPU is needed to handle the API services for the text encoder's embeddings and VAE decoding.
 ```bash
 python api/call_remote_server.py --model_dir where_you_download_dir &  ## We assume you have more than 4 GPUs available. This command will return the URL for both the caption API and the VAE API. Please use the returned URL in the following command.
     --motion_score 5.0 \
     --time_shift 12.573
 ```
+The following table shows the requirements for running Step-Video-T2V model (batch size = 1, w/o cfg distillation) to generate videos:
+| GPU  | height/width/frame | Peak GPU Memory | 50 steps |
+|------|--------------------|-----------------|----------|
+| 1    | 768px × 768px × 102f | 76.42 GB        | 1061s    |
+| 1    | 544px × 992px × 102f | 75.49 GB        | 929s     |
+| 4    | 768px × 768px × 102f | 64.63 GB        | 288s     |
+| 4    | 544px × 992px × 102f | 64.34 GB        | 251s     |