File size: 7,636 Bytes

---
license: other
license_link: https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE
language:
  - en
tags:
  - cogvideox
  - video-generation
  - thudm
  - text-to-video
inference: false
---

# CogVideoX-5B

<p style="text-align: center;">
  <div align="center">
  <img src=https://github.com/THUDM/CogVideo/raw/main/resources/logo.svg width="50%"/>
  </div>
  <p align="center">
  <a href="https://huggingface.co/THUDM/CogVideoX-5b/blob/main/README_zh.md">📄 中文阅读</a> | 
  <a href="https://huggingface.co/spaces/THUDM/CogVideoX-5B">🤗 Huggingface Space</a> |
  <a href="https://github.com/THUDM/CogVideo">🌐 Github </a> | 
  <a href="https://arxiv.org/pdf/2408.06072">📜 arxiv </a>
</p>

## Demo Show

## Model Introduction

CogVideoX is an open-source video generation model that shares the same origins as [清影](https://chatglm.cn/video).
The table below provides a list of the video generation models we currently offer, along with their basic information.

<table style="border-collapse: collapse; width: 100%;">
  <tr>
    <th style="text-align: center;">Model Name</th>
    <th style="text-align: center;">CogVideoX-2B</th>
    <th style="text-align: center;">CogVideoX-5B (Current Repository)</th>
  </tr>
  <tr>
    <td style="text-align: center;">Model Introduction</td>
    <td style="text-align: center;">An entry-level model with good compatibility. Low cost for running and secondary development.</td>
    <td style="text-align: center;">A larger model with higher video generation quality and better visual effects.</td>
  </tr>
  <tr>
    <td style="text-align: center;">Inference Precision</td>
    <td style="text-align: center;">FP16, FP32<br><b>NOT support BF16</b> </td>
    <td style="text-align: center;">BF16, FP32<br><b>NOT support FP16</b> </td>
  </tr>
  <tr>
    <td style="text-align: center;">Inference Speed<br>(Step = 50)</td>
    <td style="text-align: center;">FP16: ~90* s</td>
    <td style="text-align: center;">BF16: ~200* s</td>
  </tr>
  <tr>
    <td style="text-align: center;">Single GPU Memory Consumption</td>
    <td style="text-align: center;">18GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>12GB* using diffusers</b><br></td>
    <td style="text-align: center;">26GB using <a href="https://github.com/THUDM/SwissArmyTransformer">SAT</a><br><b>21GB* using diffusers</b><br></td>
  </tr>
  <tr>
    <td style="text-align: center;">Multi-GPU Inference Memory Consumption</td>
    <td style="text-align: center;"><b>10GB* using diffusers</b><br></td>
    <td style="text-align: center;"><b>15GB* using diffusers</b><br></td>
  </tr>
  <tr>
    <td style="text-align: center;">Fine-Tuning Memory Consumption (Per GPU)</td>
    <td style="text-align: center;">47 GB (bs=1, LORA)<br>61 GB (bs=2, LORA)<br>62GB (bs=1, SFT)</td>
    <td style="text-align: center;">63 GB (bs=1, LORA)<br>80 GB (bs=2, LORA)<br>75GB (bs=1, SFT)<br></td>
  </tr>
  <tr>
    <td style="text-align: center;">Prompt Language</td>
    <td colspan="2" style="text-align: center;">English*</td>
  </tr>
  <tr>
    <td style="text-align: center;">Maximum Prompt Length</td>
    <td colspan="2" style="text-align: center;">226 Tokens</td>
  </tr>
  <tr>
    <td style="text-align: center;">Video Length</td>
    <td colspan="2" style="text-align: center;">6 seconds</td>
  </tr>
  <tr>
    <td style="text-align: center;">Frame Rate</td>
    <td colspan="2" style="text-align: center;">8 frames per second</td>
  </tr>
  <tr>
    <td style="text-align: center;">Video Resolution</td>
    <td colspan="2" style="text-align: center;">720 x 480, does not support other resolutions (including fine-tuning)</td>
  </tr>
  <tr>
    <td style="text-align: center;">Positional Encoding</td>
    <td style="text-align: center;">3d_sincos_pos_embed</td>
    <td style="text-align: center;">3d_rope_pos_embed<br></td>
  </tr>
</table>

**Data Explanation**

+ When testing with the diffusers library, the `enable_model_cpu_offload()` and `pipe.vae.enable_tiling()` options were
  enabled. This configuration was not tested on non-**NVIDIA A100 / H100** devices, but it should generally work on all
  **NVIDIA Ampere architecture** and above. Disabling these optimizations will significantly increase memory usage, with
  peak usage approximately 3 times the values shown in the table.
+ For multi-GPU inference, `enable_model_cpu_offload()` must be disabled.
+ Inference speed tests used the above memory optimization options. Without these optimizations, inference speed
  increases by around 10%.
+ The model supports only English input. For other languages, translation to English is recommended during large model
  processing.

+ **Note** Using [SAT](https://github.com/THUDM/SwissArmyTransformer)  for inference and fine-tuning of SAT version
  models. Feel free to visit our GitHub for more information.

## Quick Start 🤗

This model supports deployment using the huggingface diffusers library. You can deploy it by following these steps.

**We recommend that you visit our [GitHub](https://github.com/THUDM/CogVideo) and check out the relevant prompt
optimizations and conversions to get a better experience.**

1. Install the required dependencies

```shell
pip install --upgrade opencv-python transformers diffusers
```

2. Run the code

```python
import gc
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video

prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."

pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
)

pipe.enable_model_cpu_offload()

gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_accumulated_memory_stats()
torch.cuda.reset_peak_memory_stats()
pipe.vae.enable_tiling()

video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)
```

If the generated model appears “all green” and not viewable in the default MAC player, it is a normal phenomenon (due to
OpenCV saving video issues). Simply use a different player to view the video.

## Explore the Model

Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:

1. More detailed technical details and code explanation.
2. Optimization and conversion of prompt words.
3. Reasoning and fine-tuning of SAT version models, and even pre-release.
4. Project update log dynamics, more interactive opportunities.
5. CogVideoX toolchain to help you better use the model.

## Model License

This model is released under the [CogVideoX LICENSE](LICENSE).

## Citation

```
@article{yang2024cogvideox,
  title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
  author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
  journal={arXiv preprint arXiv:2408.06072},
  year={2024}
}
```