---
license: other
license_link: https://huggingface.co/THUDM/CogVideoX-5b/blob/main/LICENSE
language:
- en
tags:
- cogvideox
- video-generation
- thudm
- text-to-video
inference: false
---
# CogVideoX-5B
Model Name |
CogVideoX-2B |
CogVideoX-5B (Current Repository) |
Model Introduction |
An entry-level model with good compatibility. Low cost for running and secondary development. |
A larger model with higher video generation quality and better visual effects. |
Inference Precision |
FP16, FP32 NOT support BF16 |
BF16, FP32 NOT support FP16 |
Inference Speed (Step = 50) |
FP16: ~90* s |
BF16: ~200* s |
Single GPU Memory Consumption |
18GB using SAT 12GB* using diffusers
|
26GB using SAT 21GB* using diffusers
|
Multi-GPU Inference Memory Consumption |
10GB* using diffusers
|
15GB* using diffusers
|
Fine-Tuning Memory Consumption (Per GPU) |
47 GB (bs=1, LORA) 61 GB (bs=2, LORA) 62GB (bs=1, SFT) |
63 GB (bs=1, LORA) 80 GB (bs=2, LORA) 75GB (bs=1, SFT)
|
Prompt Language |
English* |
Maximum Prompt Length |
226 Tokens |
Video Length |
6 seconds |
Frame Rate |
8 frames per second |
Video Resolution |
720 x 480, does not support other resolutions (including fine-tuning) |
Positional Encoding |
3d_sincos_pos_embed |
3d_rope_pos_embed
|
**Data Explanation**
+ When testing with the diffusers library, the `enable_model_cpu_offload()` and `pipe.vae.enable_tiling()` options were
enabled. This configuration was not tested on non-**NVIDIA A100 / H100** devices, but it should generally work on all
**NVIDIA Ampere architecture** and above. Disabling these optimizations will significantly increase memory usage, with
peak usage approximately 3 times the values shown in the table.
+ For multi-GPU inference, `enable_model_cpu_offload()` must be disabled.
+ Inference speed tests used the above memory optimization options. Without these optimizations, inference speed
increases by around 10%.
+ The model supports only English input. For other languages, translation to English is recommended during large model
processing.
+ **Note** Using [SAT](https://github.com/THUDM/SwissArmyTransformer) for inference and fine-tuning of SAT version
models. Feel free to visit our GitHub for more information.
## Quick Start 🤗
This model supports deployment using the huggingface diffusers library. You can deploy it by following these steps.
**We recommend that you visit our [GitHub](https://github.com/THUDM/CogVideo) and check out the relevant prompt
optimizations and conversions to get a better experience.**
1. Install the required dependencies
```shell
pip install --upgrade opencv-python transformers diffusers
```
2. Run the code
```python
import gc
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_accumulated_memory_stats()
torch.cuda.reset_peak_memory_stats()
pipe.vae.enable_tiling()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
export_to_video(video, "output.mp4", fps=8)
```
If the generated model appears “all green” and not viewable in the default MAC player, it is a normal phenomenon (due to
OpenCV saving video issues). Simply use a different player to view the video.
## Explore the Model
Welcome to our [github](https://github.com/THUDM/CogVideo), where you will find:
1. More detailed technical details and code explanation.
2. Optimization and conversion of prompt words.
3. Reasoning and fine-tuning of SAT version models, and even pre-release.
4. Project update log dynamics, more interactive opportunities.
5. CogVideoX toolchain to help you better use the model.
## Model License
This model is released under the [CogVideoX LICENSE](LICENSE).
## Citation
```
@article{yang2024cogvideox,
title={CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer},
author={Yang, Zhuoyi and Teng, Jiayan and Zheng, Wendi and Ding, Ming and Huang, Shiyu and Xu, Jiazheng and Yang, Yuanming and Hong, Wenyi and Zhang, Xiaohan and Feng, Guanyu and others},
journal={arXiv preprint arXiv:2408.06072},
year={2024}
}
```