Spaces:
				
			
			
	
			
			
					
		Running
		
	
	
	
			
			
	
	
	
	
		
		HunyuanVideo
HunyuanVideo by Tencent.
Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at this https URL.
Make sure to check out the Schedulers guide to learn how to explore the tradeoff between scheduler speed and quality, and see the reuse components across pipelines section to learn how to efficiently load the same components into multiple pipelines.
Recommendations for inference:
- Both text encoders should be in torch.float16.
- Transformer should be in torch.bfloat16.
- VAE should be in torch.float16.
- num_framesshould be of the form- 4 * k + 1, for example- 49or- 129.
- For smaller resolution videos, try lower values of shift(between2.0to5.0) in the Scheduler. For larger resolution images, try higher values (between7.0and12.0). The default value is7.0for HunyuanVideo.
- For more information about supported resolutions and other details, please refer to the original repository here.
Available models
The following models are available for the HunyuanVideoPipeline pipeline:
Model name
Description
hunyuanvideo-community/HunyuanVideo
Official HunyuanVideo (guidance-distilled). Performs best at multiple resolutions and frames. Performs best with guidance_scale=6.0, true_cfg_scale=1.0 and without a negative prompt.
https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V
Skywork’s custom finetune of HunyuanVideo (de-distilled). Performs best with 97x544x960 resolution, guidance_scale=1.0, true_cfg_scale=6.0 and a negative prompt.
The following models are available for the image-to-video pipeline:
Model name
Description
Skywork/SkyReels-V1-Hunyuan-I2V
Skywork’s custom finetune of HunyuanVideo (de-distilled). Performs best with 97x544x960 resolution. Performs best at 97x544x960 resolution, guidance_scale=1.0, true_cfg_scale=6.0 and a negative prompt.
hunyuanvideo-community/HunyuanVideo-I2V
Tecent’s official HunyuanVideo I2V model. Performs best at resolutions of 480, 720, 960, 1280. A higher shift value when initializing the scheduler is recommended (good values are between 7 and 20)
Quantization
Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
Refer to the Quantization overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized HunyuanVideoPipeline for inference with bitsandbytes.
Copied
import torch from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline from diffusers.utils import export_to_video
quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True) transformer_8bit = HunyuanVideoTransformer3DModel.from_pretrained( "hunyuanvideo-community/HunyuanVideo", subfolder="transformer", quantization_config=quant_config, torch_dtype=torch.bfloat16, )
pipeline = HunyuanVideoPipeline.from_pretrained( "hunyuanvideo-community/HunyuanVideo", transformer=transformer_8bit, torch_dtype=torch.float16, device_map="balanced", )
prompt = "A cat walks on the grass, realistic style." video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0] export_to_video(video, "cat.mp4", fps=15)
HunyuanVideoPipeline
class diffusers.HunyuanVideoPipeline
( text_encoder: LlamaModeltokenizer: LlamaTokenizerFasttransformer: HunyuanVideoTransformer3DModelvae: AutoencoderKLHunyuanVideoscheduler: FlowMatchEulerDiscreteSchedulertext_encoder_2: CLIPTextModeltokenizer_2: CLIPTokenizer )
Parameters
- text_encoder (LlamaModel) — Llava Llama3-8B.
- tokenizer (LlamaTokenizer) — Tokenizer from Llava Llama3-8B.
- transformer (HunyuanVideoTransformer3DModel) — Conditional Transformer to denoise the encoded image latents.
- scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with transformerto denoise the encoded image latents.
- vae (AutoencoderKLHunyuanVideo) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
- text_encoder_2 (CLIPTextModel) — CLIP, specifically the clip-vit-large-patch14 variant.
- tokenizer_2 (CLIPTokenizer) — Tokenizer of class CLIPTokenizer.
Pipeline for text-to-video generation using HunyuanVideo.
This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
__call__
( prompt: typing.Union[str, typing.List[str]] = Noneprompt_2: typing.Union[str, typing.List[str]] = Nonenegative_prompt: typing.Union[str, typing.List[str]] = Nonenegative_prompt_2: typing.Union[str, typing.List[str]] = Noneheight: int = 720width: int = 1280num_frames: int = 129num_inference_steps: int = 50sigmas: typing.List[float] = Nonetrue_cfg_scale: float = 1.0guidance_scale: float = 6.0num_videos_per_prompt: typing.Optional[int] = 1generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = Nonelatents: typing.Optional[torch.Tensor] = Noneprompt_embeds: typing.Optional[torch.Tensor] = Nonepooled_prompt_embeds: typing.Optional[torch.Tensor] = Noneprompt_attention_mask: typing.Optional[torch.Tensor] = Nonenegative_prompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_pooled_prompt_embeds: typing.Optional[torch.Tensor] = Nonenegative_prompt_attention_mask: typing.Optional[torch.Tensor] = Noneoutput_type: typing.Optional[str] = 'pil'return_dict: bool = Trueattention_kwargs: typing.Optional[typing.Dict[str, typing.Any]] = Nonecallback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = Nonecallback_on_step_end_tensor_inputs: typing.List[str] = ['latents']prompt_template: typing.Dict[str, typing.Any] = {'template': '<|start_header_id|>system<|end_header_id|>\n\nDescribe the video by detailing the following aspects: 1. The main content and theme of the video.2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.4. background environment, light, style and atmosphere.5. camera angles, movements, and transitions used in the video:<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n{}<|eot_id|>', 'crop_start': 95}max_sequence_length: int = 256 ) → export const metadata = 'undefined';~HunyuanVideoPipelineOutput or tuple
Expand 24 parameters
Parameters
- prompt (strorList[str], optional) — The prompt or prompts to guide the image generation. If not defined, one has to passprompt_embeds. instead.
- prompt_2 (strorList[str], optional) — The prompt or prompts to be sent totokenizer_2andtext_encoder_2. If not defined,promptis will be used instead.
- negative_prompt (strorList[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to passnegative_prompt_embedsinstead. Ignored when not using guidance (i.e., ignored iftrue_cfg_scaleis not greater than1).
- negative_prompt_2 (strorList[str], optional) — The prompt or prompts not to guide the image generation to be sent totokenizer_2andtext_encoder_2. If not defined,negative_promptis used in all the text-encoders.
- height (int, defaults to720) — The height in pixels of the generated image.
- width (int, defaults to1280) — The width in pixels of the generated image.
- num_frames (int, defaults to129) — The number of frames in the generated video.
- num_inference_steps (int, defaults to50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
- sigmas (List[float], optional) — Custom sigmas to use for the denoising process with schedulers which support asigmasargument in theirset_timestepsmethod. If not defined, the default behavior whennum_inference_stepsis passed will be used.
- true_cfg_scale (float, optional, defaults to 1.0) — When > 1.0 and a providednegative_prompt, enables true classifier-free guidance.
- guidance_scale (float, defaults to6.0) — Guidance scale as defined in Classifier-Free Diffusion Guidance.guidance_scaleis defined aswof equation 2. of Imagen Paper. Guidance scale is enabled by settingguidance_scale > 1. Higher guidance scale encourages to generate images that are closely linked to the textprompt, usually at the expense of lower image quality. Note that the only available HunyuanVideo model is CFG-distilled, which means that traditional guidance between unconditional and conditional latent is not applied.
- num_videos_per_prompt (int, optional, defaults to 1) — The number of images to generate per prompt.
- generator (torch.GeneratororList[torch.Generator], optional) — Atorch.Generatorto make generation deterministic.
- latents (torch.Tensor, optional) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied randomgenerator.
- prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from thepromptinput argument.
- pooled_prompt_embeds (torch.FloatTensor, optional) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled text embeddings will be generated frompromptinput argument.
- negative_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated fromnegative_promptinput argument.
- negative_pooled_prompt_embeds (torch.FloatTensor, optional) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, pooled negative_prompt_embeds will be generated fromnegative_promptinput argument.
- output_type (str, optional, defaults to"pil") — The output format of the generated image. Choose betweenPIL.Imageornp.array.
- return_dict (bool, optional, defaults toTrue) — Whether or not to return aHunyuanVideoPipelineOutputinstead of a plain tuple.
- attention_kwargs (dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorin diffusers.models.attention_processor.
- clip_skip (int, optional) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.
- callback_on_step_end (Callable,PipelineCallback,MultiPipelineCallbacks, optional) — A function or a subclass ofPipelineCallbackorMultiPipelineCallbacksthat is called at the end of each denoising step during the inference. with the following arguments:callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict).callback_kwargswill include a list of all tensors as specified bycallback_on_step_end_tensor_inputs.
- callback_on_step_end_tensor_inputs (List, optional) — The list of tensor inputs for thecallback_on_step_endfunction. The tensors specified in the list will be passed ascallback_kwargsargument. You will only be able to include variables listed in the._callback_tensor_inputsattribute of your pipeline class.
Returns
export const metadata = 'undefined';
~HunyuanVideoPipelineOutput or tuple
export const metadata = 'undefined';
If return_dict is True, HunyuanVideoPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
The call function to the pipeline for generation.
Examples:
Copied
>>> import torch >>> from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel >>> from diffusers.utils import export_to_video
>>> model_id = "hunyuanvideo-community/HunyuanVideo" >>> transformer = HunyuanVideoTransformer3DModel.from_pretrained( ... model_id, subfolder="transformer", torch_dtype=torch.bfloat16 ... ) >>> pipe = HunyuanVideoPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.float16) >>> pipe.vae.enable_tiling() >>> pipe.to("cuda")
>>> output = pipe( ... prompt="A cat walks on the grass, realistic", ... height=320, ... width=512, ... num_frames=61, ... num_inference_steps=30, ... ).frames[0] >>> export_to_video(output, "output.mp4", fps=15)
disable_vae_slicing
( )
Disable sliced VAE decoding. If enable_vae_slicing was previously enabled, this method will go back to computing decoding in one step.
disable_vae_tiling
( )
Disable tiled VAE decoding. If enable_vae_tiling was previously enabled, this method will go back to computing decoding in one step.
enable_vae_slicing
( )
Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
enable_vae_tiling
( )
Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.
HunyuanVideoPipelineOutput
class diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput
( frames: Tensor )
Parameters
- frames (torch.Tensor,np.ndarray, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of lengthbatch_size,with each sub-list containing denoised PIL image sequences of lengthnum_frames.It can also be a NumPy array or Torch tensor of shape(batch_size, num_frames, channels, height, width).
Output class for HunyuanVideo pipelines.
LTX Video
