Spaces:

jbilcke-hf
/

VideoModelStudio

Running

App Files Files Community

jbilcke-hf HF staff commited on 15 days ago

Commit

c8cb798

1 Parent(s): 98d3630

working on the new Preview tab

Browse files

Files changed (29) hide show

docs/diffusers/Load schedulers and models in Diffusers.md +199 -0
docs/diffusers/Loading pipelines in Diffusers.md +528 -0
docs/diffusers/Using Diffusers for CogVideoX.md +683 -0
docs/diffusers/Using Diffusers for HunyuanVideo.md +232 -0
docs/diffusers/Using Diffusers for LTX Video.md +421 -0
docs/diffusers/Using Diffusers for Wan.md +307 -0
finetrainers/args.py +1 -1
vms/config.py +2 -2
vms/services/__init__.py +7 -5
vms/services/{captioner.py → captioning.py} +0 -0
vms/services/{importer → importing}/__init__.py +2 -2
vms/services/{importer → importing}/file_upload.py +0 -0
vms/services/{importer → importing}/hub_dataset.py +0 -0
vms/services/{importer → importing}/import_service.py +1 -1
vms/services/{importer → importing}/youtube.py +0 -0
vms/services/previewing.py +406 -0
vms/services/{splitter.py → splitting.py} +0 -0
vms/services/{trainer.py → training.py} +0 -0
vms/tabs/caption_tab.py +4 -4
vms/tabs/import_tab/hub_tab.py +3 -3
vms/tabs/import_tab/import_tab.py +2 -2
vms/tabs/import_tab/upload_tab.py +1 -1
vms/tabs/import_tab/youtube_tab.py +1 -1
vms/tabs/manage_tab.py +17 -17
vms/tabs/monitor_tab.py +7 -7
vms/tabs/preview_tab.py +240 -0
vms/tabs/split_tab.py +3 -3
vms/tabs/train_tab.py +8 -8
vms/ui/video_trainer_ui.py +15 -14

docs/diffusers/Load schedulers and models in Diffusers.md ADDED Viewed

	@@ -0,0 +1,199 @@

+[](#load-schedulers-and-models)Load schedulers and models
+=========================================================
+![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)
+![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)
+Diffusion pipelines are a collection of interchangeable schedulers and models that can be mixed and matched to tailor a pipeline to a specific use case. The scheduler encapsulates the entire denoising process such as the number of denoising steps and the algorithm for finding the denoised sample. A scheduler is not parameterized or trained so they don’t take very much memory. The model is usually only concerned with the forward pass of going from a noisy input to a less noisy sample.
+This guide will show you how to load schedulers and models to customize a pipeline. You’ll use the [stable-diffusion-v1-5/stable-diffusion-v1-5](https://hf.co/stable-diffusion-v1-5/stable-diffusion-v1-5) checkpoint throughout this guide, so let’s load it first.
+Copied
+import torch
+from diffusers import DiffusionPipeline
+pipeline = DiffusionPipeline.from\_pretrained(
+    "stable-diffusion-v1-5/stable-diffusion-v1-5", torch\_dtype=torch.float16, use\_safetensors=True
+).to("cuda")
+You can see what scheduler this pipeline uses with the `pipeline.scheduler` attribute.
+Copied
+pipeline.scheduler
+PNDMScheduler {
+  "\_class\_name": "PNDMScheduler",
+  "\_diffusers\_version": "0.21.4",
+  "beta\_end": 0.012,
+  "beta\_schedule": "scaled\_linear",
+  "beta\_start": 0.00085,
+  "clip\_sample": false,
+  "num\_train\_timesteps": 1000,
+  "set\_alpha\_to\_one": false,
+  "skip\_prk\_steps": true,
+  "steps\_offset": 1,
+  "timestep\_spacing": "leading",
+  "trained\_betas": null
+}
+[](#load-a-scheduler)Load a scheduler
+-------------------------------------
+Schedulers are defined by a configuration file that can be used by a variety of schedulers. Load a scheduler with the [SchedulerMixin.from\_pretrained()](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin.from_pretrained) method, and specify the `subfolder` parameter to load the configuration file into the correct subfolder of the pipeline repository.
+For example, to load the [DDIMScheduler](/docs/diffusers/main/en/api/schedulers/ddim#diffusers.DDIMScheduler):
+Copied
+from diffusers import DDIMScheduler, DiffusionPipeline
+ddim = DDIMScheduler.from\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="scheduler")
+Then you can pass the newly loaded scheduler to the pipeline.
+Copied
+pipeline = DiffusionPipeline.from\_pretrained(
+    "stable-diffusion-v1-5/stable-diffusion-v1-5", scheduler=ddim, torch\_dtype=torch.float16, use\_safetensors=True
+).to("cuda")
+[](#compare-schedulers)Compare schedulers
+-----------------------------------------
+Schedulers have their own unique strengths and weaknesses, making it difficult to quantitatively compare which scheduler works best for a pipeline. You typically have to make a trade-off between denoising speed and denoising quality. We recommend trying out different schedulers to find one that works best for your use case. Call the `pipeline.scheduler.compatibles` attribute to see what schedulers are compatible with a pipeline.
+Let’s compare the [LMSDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/lms_discrete#diffusers.LMSDiscreteScheduler), [EulerDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/euler#diffusers.EulerDiscreteScheduler), [EulerAncestralDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/euler_ancestral#diffusers.EulerAncestralDiscreteScheduler), and the [DPMSolverMultistepScheduler](/docs/diffusers/main/en/api/schedulers/multistep_dpm_solver#diffusers.DPMSolverMultistepScheduler) on the following prompt and seed.
+Copied
+import torch
+from diffusers import DiffusionPipeline
+pipeline = DiffusionPipeline.from\_pretrained(
+    "stable-diffusion-v1-5/stable-diffusion-v1-5", torch\_dtype=torch.float16, use\_safetensors=True
+).to("cuda")
+prompt = "A photograph of an astronaut riding a horse on Mars, high resolution, high definition."
+generator = torch.Generator(device="cuda").manual\_seed(8)
+To change the pipelines scheduler, use the [from\_config()](/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method to load a different scheduler’s `pipeline.scheduler.config` into the pipeline.
+LMSDiscreteScheduler
+EulerDiscreteScheduler
+EulerAncestralDiscreteScheduler
+DPMSolverMultistepScheduler
+[LMSDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/lms_discrete#diffusers.LMSDiscreteScheduler) typically generates higher quality images than the default scheduler.
+Copied
+from diffusers import LMSDiscreteScheduler
+pipeline.scheduler = LMSDiscreteScheduler.from\_config(pipeline.scheduler.config)
+image = pipeline(prompt, generator=generator).images\[0\]
+image
+![](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/diffusers_docs/astronaut_lms.png)
+LMSDiscreteScheduler
+![](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/diffusers_docs/astronaut_euler_discrete.png)
+EulerDiscreteScheduler
+![](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/diffusers_docs/astronaut_euler_ancestral.png)
+EulerAncestralDiscreteScheduler
+![](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/diffusers_docs/astronaut_dpm.png)
+DPMSolverMultistepScheduler
+Most images look very similar and are comparable in quality. Again, it often comes down to your specific use case so a good approach is to run multiple different schedulers and compare the results.
+### [](#flax-schedulers)Flax schedulers
+To compare Flax schedulers, you need to additionally load the scheduler state into the model parameters. For example, let’s change the default scheduler in [FlaxStableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.FlaxStableDiffusionPipeline) to use the super fast `FlaxDPMSolverMultistepScheduler`.
+The `FlaxLMSDiscreteScheduler` and `FlaxDDPMScheduler` are not compatible with the [FlaxStableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.FlaxStableDiffusionPipeline) yet.
+Copied
+import jax
+import numpy as np
+from flax.jax\_utils import replicate
+from flax.training.common\_utils import shard
+from diffusers import FlaxStableDiffusionPipeline, FlaxDPMSolverMultistepScheduler
+scheduler, scheduler\_state = FlaxDPMSolverMultistepScheduler.from\_pretrained(
+    "stable-diffusion-v1-5/stable-diffusion-v1-5",
+    subfolder="scheduler"
+)
+pipeline, params = FlaxStableDiffusionPipeline.from\_pretrained(
+    "stable-diffusion-v1-5/stable-diffusion-v1-5",
+    scheduler=scheduler,
+    variant="bf16",
+    dtype=jax.numpy.bfloat16,
+)
+params\["scheduler"\] = scheduler\_state
+Then you can take advantage of Flax’s compatibility with TPUs to generate a number of images in parallel. You’ll need to make a copy of the model parameters for each available device and then split the inputs across them to generate your desired number of images.
+Copied
+\# Generate 1 image per parallel device (8 on TPUv2-8 or TPUv3-8)
+prompt = "A photograph of an astronaut riding a horse on Mars, high resolution, high definition."
+num\_samples = jax.device\_count()
+prompt\_ids = pipeline.prepare\_inputs(\[prompt\] \* num\_samples)
+prng\_seed = jax.random.PRNGKey(0)
+num\_inference\_steps = 25
+\# shard inputs and rng
+params = replicate(params)
+prng\_seed = jax.random.split(prng\_seed, jax.device\_count())
+prompt\_ids = shard(prompt\_ids)
+images = pipeline(prompt\_ids, params, prng\_seed, num\_inference\_steps, jit=True).images
+images = pipeline.numpy\_to\_pil(np.asarray(images.reshape((num\_samples,) + images.shape\[-3:\])))
+[](#models)Models
+-----------------
+Models are loaded from the [ModelMixin.from\_pretrained()](/docs/diffusers/main/en/api/models/overview#diffusers.ModelMixin.from_pretrained) method, which downloads and caches the latest version of the model weights and configurations. If the latest files are available in the local cache, [from\_pretrained()](/docs/diffusers/main/en/api/models/overview#diffusers.ModelMixin.from_pretrained) reuses files in the cache instead of re-downloading them.
+Models can be loaded from a subfolder with the `subfolder` argument. For example, the model weights for [stable-diffusion-v1-5/stable-diffusion-v1-5](https://hf.co/stable-diffusion-v1-5/stable-diffusion-v1-5) are stored in the [unet](https://hf.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main/unet) subfolder.
+Copied
+from diffusers import UNet2DConditionModel
+unet = UNet2DConditionModel.from\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="unet", use\_safetensors=True)
+They can also be directly loaded from a [repository](https://huggingface.co/google/ddpm-cifar10-32/tree/main).
+Copied
+from diffusers import UNet2DModel
+unet = UNet2DModel.from\_pretrained("google/ddpm-cifar10-32", use\_safetensors=True)
+To load and save model variants, specify the `variant` argument in [ModelMixin.from\_pretrained()](/docs/diffusers/main/en/api/models/overview#diffusers.ModelMixin.from_pretrained) and [ModelMixin.save\_pretrained()](/docs/diffusers/main/en/api/models/overview#diffusers.ModelMixin.save_pretrained).
+Copied
+from diffusers import UNet2DConditionModel
+unet = UNet2DConditionModel.from\_pretrained(
+    "stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="unet", variant="non\_ema", use\_safetensors=True
+)
+unet.save\_pretrained("./local-unet", variant="non\_ema")
+[< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/using-diffusers/schedulers.md)
+[←Load community pipelines and components](/docs/diffusers/main/en/using-diffusers/custom_pipeline_overview) [Model files and layouts→](/docs/diffusers/main/en/using-diffusers/other-formats)

docs/diffusers/Loading pipelines in Diffusers.md ADDED Viewed

	@@ -0,0 +1,528 @@

+[](#load-pipelines)Load pipelines
+=================================
+![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)
+![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)
+Diffusion systems consist of multiple components like parameterized models and schedulers that interact in complex ways. That is why we designed the [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) to wrap the complexity of the entire diffusion system into an easy-to-use API. At the same time, the [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) is entirely customizable so you can modify each component to build a diffusion system for your use case.
+This guide will show you how to load:
+*   pipelines from the Hub and locally
+*   different components into a pipeline
+*   multiple pipelines without increasing memory usage
+*   checkpoint variants such as different floating point types or non-exponential mean averaged (EMA) weights
+[](#load-a-pipeline)Load a pipeline
+-----------------------------------
+Skip to the [DiffusionPipeline explained](#diffusionpipeline-explained) section if you’re interested in an explanation about how the [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) class works.
+There are two ways to load a pipeline for a task:
+1.  Load the generic [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) class and allow it to automatically detect the correct pipeline class from the checkpoint.
+2.  Load a specific pipeline class for a specific task.
+generic pipeline
+specific pipeline
+The [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) class is a simple and generic way to load the latest trending diffusion model from the [Hub](https://huggingface.co/models?library=diffusers&sort=trending). It uses the [from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) method to automatically detect the correct pipeline class for a task from the checkpoint, downloads and caches all the required configuration and weight files, and returns a pipeline ready for inference.
+Copied
+from diffusers import DiffusionPipeline
+pipeline = DiffusionPipeline.from\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use\_safetensors=True)
+This same checkpoint can also be used for an image-to-image task. The [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) class can handle any task as long as you provide the appropriate inputs. For example, for an image-to-image task, you need to pass an initial image to the pipeline.
+Copied
+from diffusers import DiffusionPipeline
+pipeline = DiffusionPipeline.from\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use\_safetensors=True)
+init\_image = load\_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png")
+prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
+image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=init\_image).images\[0\]
+Use the Space below to gauge a pipeline’s memory requirements before you download and load it to see if it runs on your hardware.
+### [](#local-pipeline)Local pipeline
+To load a pipeline locally, use [git-lfs](https://git-lfs.github.com/) to manually download a checkpoint to your local disk.
+Copied
+git-lfs install
+git clone https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5
+This creates a local folder, ./stable-diffusion-v1-5, on your disk and you should pass its path to [from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained).
+Copied
+from diffusers import DiffusionPipeline
+stable\_diffusion = DiffusionPipeline.from\_pretrained("./stable-diffusion-v1-5", use\_safetensors=True)
+The [from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) method won’t download files from the Hub when it detects a local path, but this also means it won’t download and cache the latest changes to a checkpoint.
+[](#customize-a-pipeline)Customize a pipeline
+---------------------------------------------
+You can customize a pipeline by loading different components into it. This is important because you can:
+*   change to a scheduler with faster generation speed or higher generation quality depending on your needs (call the `scheduler.compatibles` method on your pipeline to see compatible schedulers)
+*   change a default pipeline component to a newer and better performing one
+For example, let’s customize the default [stabilityai/stable-diffusion-xl-base-1.0](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0) checkpoint with:
+*   The [HeunDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/heun#diffusers.HeunDiscreteScheduler) to generate higher quality images at the expense of slower generation speed. You must pass the `subfolder="scheduler"` parameter in [from\_pretrained()](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin.from_pretrained) to load the scheduler configuration into the correct [subfolder](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/scheduler) of the pipeline repository.
+*   A more stable VAE that runs in fp16.
+Copied
+from diffusers import StableDiffusionXLPipeline, HeunDiscreteScheduler, AutoencoderKL
+import torch
+scheduler = HeunDiscreteScheduler.from\_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler")
+vae = AutoencoderKL.from\_pretrained("madebyollin/sdxl-vae-fp16-fix", torch\_dtype=torch.float16, use\_safetensors=True)
+Now pass the new scheduler and VAE to the [StableDiffusionXLPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline).
+Copied
+pipeline = StableDiffusionXLPipeline.from\_pretrained(
+  "stabilityai/stable-diffusion-xl-base-1.0",
+  scheduler=scheduler,
+  vae=vae,
+  torch\_dtype=torch.float16,
+  variant="fp16",
+  use\_safetensors=True
+).to("cuda")
+[](#reuse-a-pipeline)Reuse a pipeline
+-------------------------------------
+When you load multiple pipelines that share the same model components, it makes sense to reuse the shared components instead of reloading everything into memory again, especially if your hardware is memory-constrained. For example:
+1.  You generated an image with the [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) but you want to improve its quality with the [StableDiffusionSAGPipeline](/docs/diffusers/main/en/api/pipelines/self_attention_guidance#diffusers.StableDiffusionSAGPipeline). Both of these pipelines share the same pretrained model, so it’d be a waste of memory to load the same model twice.
+2.  You want to add a model component, like a [`MotionAdapter`](../api/pipelines/animatediff#animatediffpipeline), to [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline) which was instantiated from an existing [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline). Again, both pipelines share the same pretrained model, so it’d be a waste of memory to load an entirely new pipeline again.
+With the [DiffusionPipeline.from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe) API, you can switch between multiple pipelines to take advantage of their different features without increasing memory-usage. It is similar to turning on and off a feature in your pipeline.
+To switch between tasks (rather than features), use the [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe) method with the [AutoPipeline](../api/pipelines/auto_pipeline) class, which automatically identifies the pipeline class based on the task (learn more in the [AutoPipeline](../tutorials/autopipeline) tutorial).
+Let’s start with a [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) and then reuse the loaded model components to create a [StableDiffusionSAGPipeline](/docs/diffusers/main/en/api/pipelines/self_attention_guidance#diffusers.StableDiffusionSAGPipeline) to increase generation quality. You’ll use the [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) with an [IP-Adapter](./ip_adapter) to generate a bear eating pizza.
+Copied
+from diffusers import DiffusionPipeline, StableDiffusionSAGPipeline
+import torch
+import gc
+from diffusers.utils import load\_image
+from accelerate.utils import compute\_module\_sizes
+image = load\_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/load\_neg\_embed.png")
+pipe\_sd = DiffusionPipeline.from\_pretrained("SG161222/Realistic\_Vision\_V6.0\_B1\_noVAE", torch\_dtype=torch.float16)
+pipe\_sd.load\_ip\_adapter("h94/IP-Adapter", subfolder="models", weight\_name="ip-adapter\_sd15.bin")
+pipe\_sd.set\_ip\_adapter\_scale(0.6)
+pipe\_sd.to("cuda")
+generator = torch.Generator(device="cpu").manual\_seed(33)
+out\_sd = pipe\_sd(
+    prompt="bear eats pizza",
+    negative\_prompt="wrong white balance, dark, sketches,worst quality,low quality",
+    ip\_adapter\_image=image,
+    num\_inference\_steps=50,
+    generator=generator,
+).images\[0\]
+out\_sd
+![](https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/from_pipe_out_sd_0.png)
+For reference, you can check how much memory this process consumed.
+Copied
+def bytes\_to\_giga\_bytes(bytes):
+    return bytes / 1024 / 1024 / 1024
+print(f"Max memory allocated: {bytes\_to\_giga\_bytes(torch.cuda.max\_memory\_allocated())} GB")
+"Max memory allocated: 4.406213283538818 GB"
+Now, reuse the same pipeline components from [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) in [StableDiffusionSAGPipeline](/docs/diffusers/main/en/api/pipelines/self_attention_guidance#diffusers.StableDiffusionSAGPipeline) with the [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe) method.
+Some pipeline methods may not function properly on new pipelines created with [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe). For instance, the [enable\_model\_cpu\_offload()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.enable_model_cpu_offload) method installs hooks on the model components based on a unique offloading sequence for each pipeline. If the models are executed in a different order in the new pipeline, the CPU offloading may not work correctly.
+To ensure everything works as expected, we recommend re-applying a pipeline method on a new pipeline created with [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe).
+Copied
+pipe\_sag = StableDiffusionSAGPipeline.from\_pipe(
+    pipe\_sd
+)
+generator = torch.Generator(device="cpu").manual\_seed(33)
+out\_sag = pipe\_sag(
+    prompt="bear eats pizza",
+    negative\_prompt="wrong white balance, dark, sketches,worst quality,low quality",
+    ip\_adapter\_image=image,
+    num\_inference\_steps=50,
+    generator=generator,
+    guidance\_scale=1.0,
+    sag\_scale=0.75
+).images\[0\]
+out\_sag
+![](https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/from_pipe_out_sag_1.png)
+If you check the memory usage, you’ll see it remains the same as before because [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) and [StableDiffusionSAGPipeline](/docs/diffusers/main/en/api/pipelines/self_attention_guidance#diffusers.StableDiffusionSAGPipeline) are sharing the same pipeline components. This allows you to use them interchangeably without any additional memory overhead.
+Copied
+print(f"Max memory allocated: {bytes\_to\_giga\_bytes(torch.cuda.max\_memory\_allocated())} GB")
+"Max memory allocated: 4.406213283538818 GB"
+Let’s animate the image with the [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline) and also add a `MotionAdapter` module to the pipeline. For the [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline), you need to unload the IP-Adapter first and reload it _after_ you’ve created your new pipeline (this only applies to the [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline)).
+Copied
+from diffusers import AnimateDiffPipeline, MotionAdapter, DDIMScheduler
+from diffusers.utils import export\_to\_gif
+pipe\_sag.unload\_ip\_adapter()
+adapter = MotionAdapter.from\_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch\_dtype=torch.float16)
+pipe\_animate = AnimateDiffPipeline.from\_pipe(pipe\_sd, motion\_adapter=adapter)
+pipe\_animate.scheduler = DDIMScheduler.from\_config(pipe\_animate.scheduler.config, beta\_schedule="linear")
+\# load IP-Adapter and LoRA weights again
+pipe\_animate.load\_ip\_adapter("h94/IP-Adapter", subfolder="models", weight\_name="ip-adapter\_sd15.bin")
+pipe\_animate.load\_lora\_weights("guoyww/animatediff-motion-lora-zoom-out", adapter\_name="zoom-out")
+pipe\_animate.to("cuda")
+generator = torch.Generator(device="cpu").manual\_seed(33)
+pipe\_animate.set\_adapters("zoom-out", adapter\_weights=0.75)
+out = pipe\_animate(
+    prompt="bear eats pizza",
+    num\_frames=16,
+    num\_inference\_steps=50,
+    ip\_adapter\_image=image,
+    generator=generator,
+).frames\[0\]
+export\_to\_gif(out, "out\_animate.gif")
+![](https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/from_pipe_out_animate_3.gif)
+The [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline) is more memory-intensive and consumes 15GB of memory (see the [Memory-usage of from\_pipe](#memory-usage-of-from_pipe) section to learn what this means for your memory-usage).
+Copied
+print(f"Max memory allocated: {bytes\_to\_giga\_bytes(torch.cuda.max\_memory\_allocated())} GB")
+"Max memory allocated: 15.178664207458496 GB"
+### [](#modify-frompipe-components)Modify from\_pipe components
+Pipelines loaded with [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe) can be customized with different model components or methods. However, whenever you modify the _state_ of the model components, it affects all the other pipelines that share the same components. For example, if you call [unload\_ip\_adapter()](/docs/diffusers/main/en/api/loaders/ip_adapter#diffusers.loaders.IPAdapterMixin.unload_ip_adapter) on the [StableDiffusionSAGPipeline](/docs/diffusers/main/en/api/pipelines/self_attention_guidance#diffusers.StableDiffusionSAGPipeline), you won’t be able to use IP-Adapter with the [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) because it’s been removed from their shared components.
+Copied
+pipe.sag\_unload\_ip\_adapter()
+generator = torch.Generator(device="cpu").manual\_seed(33)
+out\_sd = pipe\_sd(
+    prompt="bear eats pizza",
+    negative\_prompt="wrong white balance, dark, sketches,worst quality,low quality",
+    ip\_adapter\_image=image,
+    num\_inference\_steps=50,
+    generator=generator,
+).images\[0\]
+"AttributeError: 'NoneType' object has no attribute 'image\_projection\_layers'"
+### [](#memory-usage-of-frompipe)Memory usage of from\_pipe
+The memory requirement of loading multiple pipelines with [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe) is determined by the pipeline with the highest memory-usage regardless of the number of pipelines you create.
+Pipeline
+Memory usage (GB)
+StableDiffusionPipeline
+4.400
+StableDiffusionSAGPipeline
+4.400
+AnimateDiffPipeline
+15.178
+The [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline) has the highest memory requirement, so the _total memory-usage_ is based only on the [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline). Your memory-usage will not increase if you create additional pipelines as long as their memory requirements doesn’t exceed that of the [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline). Each pipeline can be used interchangeably without any additional memory overhead.
+[](#safety-checker)Safety checker
+---------------------------------
+Diffusers implements a [safety checker](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) for Stable Diffusion models which can generate harmful content. The safety checker screens the generated output against known hardcoded not-safe-for-work (NSFW) content. If for whatever reason you’d like to disable the safety checker, pass `safety_checker=None` to the [from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) method.
+Copied
+from diffusers import DiffusionPipeline
+pipeline = DiffusionPipeline.from\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", safety\_checker=None, use\_safetensors=True)
+"""
+You have disabled the safety checker for <class 'diffusers.pipelines.stable\_diffusion.pipeline\_stable\_diffusion.StableDiffusionPipeline'> by passing \`safety\_checker=None\`. Ensure that you abide by the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend keeping the safety filter enabled in all public-facing circumstances, disabling it only for use cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
+"""
+[](#checkpoint-variants)Checkpoint variants
+-------------------------------------------
+A checkpoint variant is usually a checkpoint whose weights are:
+*   Stored in a different floating point type, such as [torch.float16](https://pytorch.org/docs/stable/tensors.html#data-types), because it only requires half the bandwidth and storage to download. You can’t use this variant if you’re continuing training or using a CPU.
+*   Non-exponential mean averaged (EMA) weights which shouldn’t be used for inference. You should use this variant to continue finetuning a model.
+When the checkpoints have identical model structures, but they were trained on different datasets and with a different training setup, they should be stored in separate repositories. For example, [stabilityai/stable-diffusion-2](https://hf.co/stabilityai/stable-diffusion-2) and [stabilityai/stable-diffusion-2-1](https://hf.co/stabilityai/stable-diffusion-2-1) are stored in separate repositories.
+Otherwise, a variant is **identical** to the original checkpoint. They have exactly the same serialization format (like [safetensors](./using_safetensors)), model structure, and their weights have identical tensor shapes.
+**checkpoint type**
+**weight name**
+**argument for loading weights**
+original
+diffusion\_pytorch\_model.safetensors
+floating point
+diffusion\_pytorch\_model.fp16.safetensors
+`variant`, `torch_dtype`
+non-EMA
+diffusion\_pytorch\_model.non\_ema.safetensors
+`variant`
+There are two important arguments for loading variants:
+*   `torch_dtype` specifies the floating point precision of the loaded checkpoint. For example, if you want to save bandwidth by loading a fp16 variant, you should set `variant="fp16"` and `torch_dtype=torch.float16` to _convert the weights_ to fp16. Otherwise, the fp16 weights are converted to the default fp32 precision.
+    If you only set `torch_dtype=torch.float16`, the default fp32 weights are downloaded first and then converted to fp16.
+*   `variant` specifies which files should be loaded from the repository. For example, if you want to load a non-EMA variant of a UNet from [stable-diffusion-v1-5/stable-diffusion-v1-5](https://hf.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main/unet), set `variant="non_ema"` to download the `non_ema` file.
+fp16
+non-EMA
+Copied
+from diffusers import DiffusionPipeline
+import torch
+pipeline = DiffusionPipeline.from\_pretrained(
+    "stable-diffusion-v1-5/stable-diffusion-v1-5", variant="fp16", torch\_dtype=torch.float16, use\_safetensors=True
+)
+Use the `variant` parameter in the [DiffusionPipeline.save\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.save_pretrained) method to save a checkpoint as a different floating point type or as a non-EMA variant. You should try save a variant to the same folder as the original checkpoint, so you have the option of loading both from the same folder.
+fp16
+non\_ema
+Copied
+from diffusers import DiffusionPipeline
+pipeline.save\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", variant="fp16")
+If you don’t save the variant to an existing folder, you must specify the `variant` argument otherwise it’ll throw an `Exception` because it can’t find the original checkpoint.
+Copied
+\# 👎 this won't work
+pipeline = DiffusionPipeline.from\_pretrained(
+    "./stable-diffusion-v1-5", torch\_dtype=torch.float16, use\_safetensors=True
+)
+\# 👍 this works
+pipeline = DiffusionPipeline.from\_pretrained(
+    "./stable-diffusion-v1-5", variant="fp16", torch\_dtype=torch.float16, use\_safetensors=True
+)
+[](#diffusionpipeline-explained)DiffusionPipeline explained
+-----------------------------------------------------------
+As a class method, [DiffusionPipeline.from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) is responsible for two things:
+*   Download the latest version of the folder structure required for inference and cache it. If the latest folder structure is available in the local cache, [DiffusionPipeline.from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) reuses the cache and won’t redownload the files.
+*   Load the cached weights into the correct pipeline [class](../api/pipelines/overview#diffusers-summary) - retrieved from the `model_index.json` file - and return an instance of it.
+The pipelines’ underlying folder structure corresponds directly with their class instances. For example, the [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) corresponds to the folder structure in [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5).
+Copied
+from diffusers import DiffusionPipeline
+repo\_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
+pipeline = DiffusionPipeline.from\_pretrained(repo\_id, use\_safetensors=True)
+print(pipeline)
+You’ll see pipeline is an instance of [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline), which consists of seven components:
+*   `"feature_extractor"`: a [CLIPImageProcessor](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPImageProcessor) from 🤗 Transformers.
+*   `"safety_checker"`: a [component](https://github.com/huggingface/diffusers/blob/e55687e1e15407f60f32242027b7bb8170e58266/src/diffusers/pipelines/stable_diffusion/safety_checker.py#L32) for screening against harmful content.
+*   `"scheduler"`: an instance of [PNDMScheduler](/docs/diffusers/main/en/api/schedulers/pndm#diffusers.PNDMScheduler).
+*   `"text_encoder"`: a [CLIPTextModel](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPTextModel) from 🤗 Transformers.
+*   `"tokenizer"`: a [CLIPTokenizer](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPTokenizer) from 🤗 Transformers.
+*   `"unet"`: an instance of [UNet2DConditionModel](/docs/diffusers/main/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel).
+*   `"vae"`: an instance of [AutoencoderKL](/docs/diffusers/main/en/api/models/autoencoderkl#diffusers.AutoencoderKL).
+Copied
+StableDiffusionPipeline {
+  "feature\_extractor": \[
+    "transformers",
+    "CLIPImageProcessor"
+  \],
+  "safety\_checker": \[
+    "stable\_diffusion",
+    "StableDiffusionSafetyChecker"
+  \],
+  "scheduler": \[
+    "diffusers",
+    "PNDMScheduler"
+  \],
+  "text\_encoder": \[
+    "transformers",
+    "CLIPTextModel"
+  \],
+  "tokenizer": \[
+    "transformers",
+    "CLIPTokenizer"
+  \],
+  "unet": \[
+    "diffusers",
+    "UNet2DConditionModel"
+  \],
+  "vae": \[
+    "diffusers",
+    "AutoencoderKL"
+  \]
+}
+Compare the components of the pipeline instance to the [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main) folder structure, and you’ll see there is a separate folder for each of the components in the repository:
+Copied
+.
+├── feature\_extractor
+│   └── preprocessor\_config.json
+├── model\_index.json
+├── safety\_checker
+│   ├── config.json
+|   ├── model.fp16.safetensors
+│   ├── model.safetensors
+│   ├── pytorch\_model.bin
+|   └── pytorch\_model.fp16.bin
+├── scheduler
+│   └── scheduler\_config.json
+├── text\_encoder
+│   ├── config.json
+|   ├── model.fp16.safetensors
+│   ├── model.safetensors
+│   |── pytorch\_model.bin
+|   └── pytorch\_model.fp16.bin
+├── tokenizer
+│   ├── merges.txt
+│   ├── special\_tokens\_map.json
+│   ├── tokenizer\_config.json
+│   └── vocab.json
+├── unet
+│   ├── config.json
+│   ├── diffusion\_pytorch\_model.bin
+|   |── diffusion\_pytorch\_model.fp16.bin
+│   |── diffusion\_pytorch\_model.f16.safetensors
+│   |── diffusion\_pytorch\_model.non\_ema.bin
+│   |── diffusion\_pytorch\_model.non\_ema.safetensors
+│   └── diffusion\_pytorch\_model.safetensors
+|── vae
+.   ├── config.json
+.   ├── diffusion\_pytorch\_model.bin
+    ├── diffusion\_pytorch\_model.fp16.bin
+    ├── diffusion\_pytorch\_model.fp16.safetensors
+    └── diffusion\_pytorch\_model.safetensors
+You can access each of the components of the pipeline as an attribute to view its configuration:
+Copied
+pipeline.tokenizer
+CLIPTokenizer(
+    name\_or\_path="/root/.cache/huggingface/hub/models--runwayml--stable-diffusion-v1-5/snapshots/39593d5650112b4cc580433f6b0435385882d819/tokenizer",
+    vocab\_size=49408,
+    model\_max\_length=77,
+    is\_fast=False,
+    padding\_side="right",
+    truncation\_side="right",
+    special\_tokens={
+        "bos\_token": AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single\_word=False, normalized=True),
+        "eos\_token": AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single\_word=False, normalized=True),
+        "unk\_token": AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single\_word=False, normalized=True),
+        "pad\_token": "<|endoftext|>",
+    },
+    clean\_up\_tokenization\_spaces=True
+)
+Every pipeline expects a [`model_index.json`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/model_index.json) file that tells the [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline):
+*   which pipeline class to load from `_class_name`
+*   which version of 🧨 Diffusers was used to create the model in `_diffusers_version`
+*   what components from which library are stored in the subfolders (`name` corresponds to the component and subfolder name, `library` corresponds to the name of the library to load the class from, and `class` corresponds to the class name)
+Copied
+{
+  "\_class\_name": "StableDiffusionPipeline",
+  "\_diffusers\_version": "0.6.0",
+  "feature\_extractor": \[
+    "transformers",
+    "CLIPImageProcessor"
+  \],
+  "safety\_checker": \[
+    "stable\_diffusion",
+    "StableDiffusionSafetyChecker"
+  \],
+  "scheduler": \[
+    "diffusers",
+    "PNDMScheduler"
+  \],
+  "text\_encoder": \[
+    "transformers",
+    "CLIPTextModel"
+  \],
+  "tokenizer": \[
+    "transformers",
+    "CLIPTokenizer"
+  \],
+  "unet": \[
+    "diffusers",
+    "UNet2DConditionModel"
+  \],
+  "vae": \[
+    "diffusers",
+    "AutoencoderKL"
+  \]
+}
+[< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/using-diffusers/loading.md)
+[←Working with big models](/docs/diffusers/main/en/tutorials/inference_with_big_models) [Load community pipelines and components→](/docs/diffusers/main/en/using-diffusers/custom_pipeline_overview)

docs/diffusers/Using Diffusers for CogVideoX.md ADDED Viewed

	@@ -0,0 +1,683 @@

+[](#cogvideox)CogVideoX
+=======================
+![LoRA](https://img.shields.io/badge/LoRA-d8b4fe?style=flat)
+[CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://arxiv.org/abs/2408.06072) from Tsinghua University & ZhipuAI, by Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang.
+The abstract from the paper is:
+_We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motion. In addition, we develop an effectively text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of CogVideoX-2B is publicly available at [https://github.com/THUDM/CogVideo](https://github.com/THUDM/CogVideo)._
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
+This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
+There are three official CogVideoX checkpoints for text-to-video and video-to-video.
+checkpoints
+recommended inference dtype
+[`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b)
+torch.float16
+[`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b)
+torch.bfloat16
+[`THUDM/CogVideoX1.5-5b`](https://huggingface.co/THUDM/CogVideoX1.5-5b)
+torch.bfloat16
+There are two official CogVideoX checkpoints available for image-to-video.
+checkpoints
+recommended inference dtype
+[`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V)
+torch.bfloat16
+[`THUDM/CogVideoX-1.5-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-1.5-5b-I2V)
+torch.bfloat16
+For the CogVideoX 1.5 series:
+*   Text-to-video (T2V) works best at a resolution of 1360x768 because it was trained with that specific resolution.
+*   Image-to-video (I2V) works for multiple resolutions. The width can vary from 768 to 1360, but the height must be 768. The height/width must be divisible by 16.
+*   Both T2V and I2V models support generation with 81 and 161 frames and work best at this value. Exporting videos at 16 FPS is recommended.
+There are two official CogVideoX checkpoints that support pose controllable generation (by the [Alibaba-PAI](https://huggingface.co/alibaba-pai) team).
+checkpoints
+recommended inference dtype
+[`alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose)
+torch.bfloat16
+[`alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose)
+torch.bfloat16
+[](#inference)Inference
+-----------------------
+Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
+First, load the pipeline:
+Copied
+import torch
+from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline
+from diffusers.utils import export\_to\_video,load\_image
+pipe = CogVideoXPipeline.from\_pretrained("THUDM/CogVideoX-5b").to("cuda") \# or "THUDM/CogVideoX-2b"
+If you are using the image-to-video pipeline, load it as follows:
+Copied
+pipe = CogVideoXImageToVideoPipeline.from\_pretrained("THUDM/CogVideoX-5b-I2V").to("cuda")
+Then change the memory layout of the pipelines `transformer` component to `torch.channels_last`:
+Copied
+pipe.transformer.to(memory\_format=torch.channels\_last)
+Compile the components and run inference:
+Copied
+pipe.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
+\# CogVideoX works well with long and well-described prompts
+prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
+video = pipe(prompt=prompt, guidance\_scale=6, num\_inference\_steps=50).frames\[0\]
+The [T2V benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are:
+Copied
+Without torch.compile(): Average inference time: 96.89 seconds.
+With torch.compile(): Average inference time: 76.27 seconds.
+### [](#memory-optimization)Memory optimization
+CogVideoX-2b requires about 19 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/a-r-r-o-w/3959a03f15be5c9bd1fe545b09dfcc93) script.
+*   `pipe.enable_model_cpu_offload()`:
+    *   Without enabling cpu offloading, memory usage is `33 GB`
+    *   With enabling cpu offloading, memory usage is `19 GB`
+*   `pipe.enable_sequential_cpu_offload()`:
+    *   Similar to `enable_model_cpu_offload` but can significantly reduce memory usage at the cost of slow inference
+    *   When enabled, memory usage is under `4 GB`
+*   `pipe.vae.enable_tiling()`:
+    *   With enabling cpu offloading and tiling, memory usage is `11 GB`
+*   `pipe.vae.enable_slicing()`
+[](#quantization)Quantization
+-----------------------------
+Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
+Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [CogVideoXPipeline](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.CogVideoXPipeline) for inference with bitsandbytes.
+Copied
+import torch
+from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, CogVideoXTransformer3DModel, CogVideoXPipeline
+from diffusers.utils import export\_to\_video
+from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
+quant\_config = BitsAndBytesConfig(load\_in\_8bit=True)
+text\_encoder\_8bit = T5EncoderModel.from\_pretrained(
+    "THUDM/CogVideoX-2b",
+    subfolder="text\_encoder",
+    quantization\_config=quant\_config,
+    torch\_dtype=torch.float16,
+)
+quant\_config = DiffusersBitsAndBytesConfig(load\_in\_8bit=True)
+transformer\_8bit = CogVideoXTransformer3DModel.from\_pretrained(
+    "THUDM/CogVideoX-2b",
+    subfolder="transformer",
+    quantization\_config=quant\_config,
+    torch\_dtype=torch.float16,
+)
+pipeline = CogVideoXPipeline.from\_pretrained(
+    "THUDM/CogVideoX-2b",
+    text\_encoder=text\_encoder\_8bit,
+    transformer=transformer\_8bit,
+    torch\_dtype=torch.float16,
+    device\_map="balanced",
+)
+prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
+video = pipeline(prompt=prompt, guidance\_scale=6, num\_inference\_steps=50).frames\[0\]
+export\_to\_video(video, "ship.mp4", fps=8)
+[](#diffusers.CogVideoXPipeline)CogVideoXPipeline
+-------------------------------------------------
+### class diffusers.CogVideoXPipeline
+[](#diffusers.CogVideoXPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py#L147)
+( tokenizer: T5Tokenizertext\_encoder: T5EncoderModelvae: AutoencoderKLCogVideoXtransformer: CogVideoXTransformer3DModelscheduler: typing.Union\[diffusers.schedulers.scheduling\_ddim\_cogvideox.CogVideoXDDIMScheduler, diffusers.schedulers.scheduling\_dpm\_cogvideox.CogVideoXDPMScheduler\] )
+Parameters
+*   [](#diffusers.CogVideoXPipeline.vae)**vae** ([AutoencoderKL](/docs/diffusers/main/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
+*   [](#diffusers.CogVideoXPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) — Frozen text-encoder. CogVideoX uses [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel); specifically the [t5-v1\_1-xxl](https://huggingface.co/PixArt-alpha/PixArt-alpha/tree/main/t5-v1_1-xxl) variant.
+*   [](#diffusers.CogVideoXPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) — Tokenizer of class [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
+*   [](#diffusers.CogVideoXPipeline.transformer)**transformer** ([CogVideoXTransformer3DModel](/docs/diffusers/main/en/api/models/cogvideox_transformer3d#diffusers.CogVideoXTransformer3DModel)) — A text conditioned `CogVideoXTransformer3DModel` to denoise the encoded video latents.
+*   [](#diffusers.CogVideoXPipeline.scheduler)**scheduler** ([SchedulerMixin](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin)) — A scheduler to be used in combination with `transformer` to denoise the encoded video latents.
+Pipeline for text-to-video generation using CogVideoX.
+This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+#### \_\_call\_\_
+[](#diffusers.CogVideoXPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py#L505)
+( prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Noneheight: typing.Optional\[int\] = Nonewidth: typing.Optional\[int\] = Nonenum\_frames: typing.Optional\[int\] = Nonenum\_inference\_steps: int = 50timesteps: typing.Optional\[typing.List\[int\]\] = Noneguidance\_scale: float = 6use\_dynamic\_cfg: bool = Falsenum\_videos\_per\_prompt: int = 1eta: float = 0.0generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.FloatTensor\] = Noneprompt\_embeds: typing.Optional\[torch.FloatTensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.FloatTensor\] = Noneoutput\_type: str = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 226 ) → export const metadata = 'undefined';[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
+Expand 19 parameters
+Parameters
+*   [](#diffusers.CogVideoXPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
+*   [](#diffusers.CogVideoXPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.CogVideoXPipeline.__call__.height)**height** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) — The height in pixels of the generated image. This is set to 480 by default for the best results.
+*   [](#diffusers.CogVideoXPipeline.__call__.width)**width** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) — The width in pixels of the generated image. This is set to 720 by default for the best results.
+*   [](#diffusers.CogVideoXPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `48`) — Number of frames to generate. Must be divisible by self.vae\_scale\_factor\_temporal. Generated video will contain 1 extra frame because CogVideoX is conditioned with (num\_seconds \* fps + 1) frames where num\_seconds is 6 and fps is 8. However, since videos can be saved at any fps, the only condition that needs to be satisfied is that of divisibility mentioned above.
+*   [](#diffusers.CogVideoXPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
+*   [](#diffusers.CogVideoXPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) — Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
+*   [](#diffusers.CogVideoXPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, _optional_, defaults to 7.0) — Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
+*   [](#diffusers.CogVideoXPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — The number of videos to generate per prompt.
+*   [](#diffusers.CogVideoXPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) — One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
+*   [](#diffusers.CogVideoXPipeline.__call__.latents)**latents** (`torch.FloatTensor`, _optional_) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
+*   [](#diffusers.CogVideoXPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.FloatTensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.CogVideoXPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.CogVideoXPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) — The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+*   [](#diffusers.CogVideoXPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) — Whether or not to return a `~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput` instead of a plain tuple.
+*   [](#diffusers.CogVideoXPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) — A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+*   [](#diffusers.CogVideoXPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
+*   [](#diffusers.CogVideoXPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) — The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
+*   [](#diffusers.CogVideoXPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int`, defaults to `226`) — Maximum sequence length in encoded prompt. Must be consistent with `self.transformer.config.max_text_seq_length` otherwise may lead to poor results.
+Returns
+export const metadata = 'undefined';
+[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
+export const metadata = 'undefined';
+[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) if `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
+Function invoked when calling the pipeline for generation.
+[](#diffusers.CogVideoXPipeline.__call__.example)
+Examples:
+Copied
+\>>> import torch
+\>>> from diffusers import CogVideoXPipeline
+\>>> from diffusers.utils import export\_to\_video
+\>>> \# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
+\>>> pipe = CogVideoXPipeline.from\_pretrained("THUDM/CogVideoX-2b", torch\_dtype=torch.float16).to("cuda")
+\>>> prompt = (
+...     "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
+...     "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
+...     "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
+...     "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
+...     "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
+...     "atmosphere of this unique musical performance."
+... )
+\>>> video = pipe(prompt=prompt, guidance\_scale=6, num\_inference\_steps=50).frames\[0\]
+\>>> export\_to\_video(video, "output.mp4", fps=8)
+#### encode\_prompt
+[](#diffusers.CogVideoXPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py#L244)
+( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
+Parameters
+*   [](#diffusers.CogVideoXPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) — prompt to be encoded
+*   [](#diffusers.CogVideoXPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.CogVideoXPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) — Whether to use classifier free guidance or not.
+*   [](#diffusers.CogVideoXPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
+*   [](#diffusers.CogVideoXPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.CogVideoXPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.CogVideoXPipeline.encode_prompt.device)**device** — (`torch.device`, _optional_): torch device
+*   [](#diffusers.CogVideoXPipeline.encode_prompt.dtype)**dtype** — (`torch.dtype`, _optional_): torch dtype
+Encodes the prompt into text encoder hidden states.
+#### fuse\_qkv\_projections
+[](#diffusers.CogVideoXPipeline.fuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py#L428)
+( )
+Enables fused QKV projections.
+#### unfuse\_qkv\_projections
+[](#diffusers.CogVideoXPipeline.unfuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py#L433)
+( )
+Disable QKV projection fusion if enabled.
+[](#diffusers.CogVideoXImageToVideoPipeline)CogVideoXImageToVideoPipeline
+-------------------------------------------------------------------------
+### class diffusers.CogVideoXImageToVideoPipeline
+[](#diffusers.CogVideoXImageToVideoPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L164)
+( tokenizer: T5Tokenizertext\_encoder: T5EncoderModelvae: AutoencoderKLCogVideoXtransformer: CogVideoXTransformer3DModelscheduler: typing.Union\[diffusers.schedulers.scheduling\_ddim\_cogvideox.CogVideoXDDIMScheduler, diffusers.schedulers.scheduling\_dpm\_cogvideox.CogVideoXDPMScheduler\] )
+Parameters
+*   [](#diffusers.CogVideoXImageToVideoPipeline.vae)**vae** ([AutoencoderKL](/docs/diffusers/main/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) — Frozen text-encoder. CogVideoX uses [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel); specifically the [t5-v1\_1-xxl](https://huggingface.co/PixArt-alpha/PixArt-alpha/tree/main/t5-v1_1-xxl) variant.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) — Tokenizer of class [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
+*   [](#diffusers.CogVideoXImageToVideoPipeline.transformer)**transformer** ([CogVideoXTransformer3DModel](/docs/diffusers/main/en/api/models/cogvideox_transformer3d#diffusers.CogVideoXTransformer3DModel)) — A text conditioned `CogVideoXTransformer3DModel` to denoise the encoded video latents.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.scheduler)**scheduler** ([SchedulerMixin](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin)) — A scheduler to be used in combination with `transformer` to denoise the encoded video latents.
+Pipeline for image-to-video generation using CogVideoX.
+This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+#### \_\_call\_\_
+[](#diffusers.CogVideoXImageToVideoPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L602)
+( image: typing.Union\[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List\[PIL.Image.Image\], typing.List\[numpy.ndarray\], typing.List\[torch.Tensor\]\]prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Noneheight: typing.Optional\[int\] = Nonewidth: typing.Optional\[int\] = Nonenum\_frames: int = 49num\_inference\_steps: int = 50timesteps: typing.Optional\[typing.List\[int\]\] = Noneguidance\_scale: float = 6use\_dynamic\_cfg: bool = Falsenum\_videos\_per\_prompt: int = 1eta: float = 0.0generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.FloatTensor\] = Noneprompt\_embeds: typing.Optional\[torch.FloatTensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.FloatTensor\] = Noneoutput\_type: str = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 226 ) → export const metadata = 'undefined';[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
+Expand 20 parameters
+Parameters
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.image)**image** (`PipelineImageInput`) — The input image to condition the generation on. Must be an image, a list of images or a `torch.Tensor`.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.height)**height** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) — The height in pixels of the generated image. This is set to 480 by default for the best results.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.width)**width** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) — The width in pixels of the generated image. This is set to 720 by default for the best results.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `48`) — Number of frames to generate. Must be divisible by self.vae\_scale\_factor\_temporal. Generated video will contain 1 extra frame because CogVideoX is conditioned with (num\_seconds \* fps + 1) frames where num\_seconds is 6 and fps is 8. However, since videos can be saved at any fps, the only condition that needs to be satisfied is that of divisibility mentioned above.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) — Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, _optional_, defaults to 7.0) — Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — The number of videos to generate per prompt.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) — One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.latents)**latents** (`torch.FloatTensor`, _optional_) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.FloatTensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) — The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) — Whether or not to return a `~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput` instead of a plain tuple.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) — A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) — The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int`, defaults to `226`) — Maximum sequence length in encoded prompt. Must be consistent with `self.transformer.config.max_text_seq_length` otherwise may lead to poor results.
+Returns
+export const metadata = 'undefined';
+[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
+export const metadata = 'undefined';
+[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) if `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
+Function invoked when calling the pipeline for generation.
+[](#diffusers.CogVideoXImageToVideoPipeline.__call__.example)
+Examples:
+Copied
+\>>> import torch
+\>>> from diffusers import CogVideoXImageToVideoPipeline
+\>>> from diffusers.utils import export\_to\_video, load\_image
+\>>> pipe = CogVideoXImageToVideoPipeline.from\_pretrained("THUDM/CogVideoX-5b-I2V", torch\_dtype=torch.bfloat16)
+\>>> pipe.to("cuda")
+\>>> prompt = "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
+\>>> image = load\_image(
+...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
+... )
+\>>> video = pipe(image, prompt, use\_dynamic\_cfg=True)
+\>>> export\_to\_video(video.frames\[0\], "output.mp4", fps=8)
+#### encode\_prompt
+[](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L267)
+( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
+Parameters
+*   [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) — prompt to be encoded
+*   [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) — Whether to use classifier free guidance or not.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
+*   [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.device)**device** — (`torch.device`, _optional_): torch device
+*   [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.dtype)**dtype** — (`torch.dtype`, _optional_): torch dtype
+Encodes the prompt into text encoder hidden states.
+#### fuse\_qkv\_projections
+[](#diffusers.CogVideoXImageToVideoPipeline.fuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L523)
+( )
+Enables fused QKV projections.
+#### unfuse\_qkv\_projections
+[](#diffusers.CogVideoXImageToVideoPipeline.unfuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L529)
+( )
+Disable QKV projection fusion if enabled.
+[](#diffusers.CogVideoXVideoToVideoPipeline)CogVideoXVideoToVideoPipeline
+-------------------------------------------------------------------------
+### class diffusers.CogVideoXVideoToVideoPipeline
+[](#diffusers.CogVideoXVideoToVideoPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L169)
+( tokenizer: T5Tokenizertext\_encoder: T5EncoderModelvae: AutoencoderKLCogVideoXtransformer: CogVideoXTransformer3DModelscheduler: typing.Union\[diffusers.schedulers.scheduling\_ddim\_cogvideox.CogVideoXDDIMScheduler, diffusers.schedulers.scheduling\_dpm\_cogvideox.CogVideoXDPMScheduler\] )
+Parameters
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.vae)**vae** ([AutoencoderKL](/docs/diffusers/main/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) — Frozen text-encoder. CogVideoX uses [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel); specifically the [t5-v1\_1-xxl](https://huggingface.co/PixArt-alpha/PixArt-alpha/tree/main/t5-v1_1-xxl) variant.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) — Tokenizer of class [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.transformer)**transformer** ([CogVideoXTransformer3DModel](/docs/diffusers/main/en/api/models/cogvideox_transformer3d#diffusers.CogVideoXTransformer3DModel)) — A text conditioned `CogVideoXTransformer3DModel` to denoise the encoded video latents.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.scheduler)**scheduler** ([SchedulerMixin](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin)) — A scheduler to be used in combination with `transformer` to denoise the encoded video latents.
+Pipeline for video-to-video generation using CogVideoX.
+This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+#### \_\_call\_\_
+[](#diffusers.CogVideoXVideoToVideoPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L575)
+( video: typing.List\[PIL.Image.Image\] = Noneprompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Noneheight: typing.Optional\[int\] = Nonewidth: typing.Optional\[int\] = Nonenum\_inference\_steps: int = 50timesteps: typing.Optional\[typing.List\[int\]\] = Nonestrength: float = 0.8guidance\_scale: float = 6use\_dynamic\_cfg: bool = Falsenum\_videos\_per\_prompt: int = 1eta: float = 0.0generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.FloatTensor\] = Noneprompt\_embeds: typing.Optional\[torch.FloatTensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.FloatTensor\] = Noneoutput\_type: str = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 226 ) → export const metadata = 'undefined';[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
+Expand 20 parameters
+Parameters
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.video)**video** (`List[PIL.Image.Image]`) — The input video to condition the generation on. Must be a list of images/frames of the video.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.height)**height** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) — The height in pixels of the generated image. This is set to 480 by default for the best results.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.width)**width** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) — The width in pixels of the generated image. This is set to 720 by default for the best results.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) — Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.strength)**strength** (`float`, _optional_, defaults to 0.8) — Higher strength leads to more differences between original video and generated video.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, _optional_, defaults to 7.0) — Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — The number of videos to generate per prompt.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) — One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.latents)**latents** (`torch.FloatTensor`, _optional_) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.FloatTensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) — The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) — Whether or not to return a `~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput` instead of a plain tuple.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) — A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) — The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int`, defaults to `226`) — Maximum sequence length in encoded prompt. Must be consistent with `self.transformer.config.max_text_seq_length` otherwise may lead to poor results.
+Returns
+export const metadata = 'undefined';
+[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
+export const metadata = 'undefined';
+[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) if `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
+Function invoked when calling the pipeline for generation.
+[](#diffusers.CogVideoXVideoToVideoPipeline.__call__.example)
+Examples:
+Copied
+\>>> import torch
+\>>> from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
+\>>> from diffusers.utils import export\_to\_video, load\_video
+\>>> \# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
+\>>> pipe = CogVideoXVideoToVideoPipeline.from\_pretrained("THUDM/CogVideoX-5b", torch\_dtype=torch.bfloat16)
+\>>> pipe.to("cuda")
+\>>> pipe.scheduler = CogVideoXDPMScheduler.from\_config(pipe.scheduler.config)
+\>>> input\_video = load\_video(
+...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
+... )
+\>>> prompt = (
+...     "An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and "
+...     "valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in "
+...     "the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, "
+...     "moons, but the remainder of the scene is mostly realistic."
+... )
+\>>> video = pipe(
+...     video=input\_video, prompt=prompt, strength=0.8, guidance\_scale=6, num\_inference\_steps=50
+... ).frames\[0\]
+\>>> export\_to\_video(video, "output.mp4", fps=8)
+#### encode\_prompt
+[](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L269)
+( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
+Parameters
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) — prompt to be encoded
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) — Whether to use classifier free guidance or not.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.device)**device** — (`torch.device`, _optional_): torch device
+*   [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.dtype)**dtype** — (`torch.dtype`, _optional_): torch dtype
+Encodes the prompt into text encoder hidden states.
+#### fuse\_qkv\_projections
+[](#diffusers.CogVideoXVideoToVideoPipeline.fuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L496)
+( )
+Enables fused QKV projections.
+#### unfuse\_qkv\_projections
+[](#diffusers.CogVideoXVideoToVideoPipeline.unfuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L502)
+( )
+Disable QKV projection fusion if enabled.
+[](#diffusers.CogVideoXFunControlPipeline)CogVideoXFunControlPipeline
+---------------------------------------------------------------------
+### class diffusers.CogVideoXFunControlPipeline
+[](#diffusers.CogVideoXFunControlPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_control.py#L154)
+( tokenizer: T5Tokenizertext\_encoder: T5EncoderModelvae: AutoencoderKLCogVideoXtransformer: CogVideoXTransformer3DModelscheduler: KarrasDiffusionSchedulers )
+Parameters
+*   [](#diffusers.CogVideoXFunControlPipeline.vae)**vae** ([AutoencoderKL](/docs/diffusers/main/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
+*   [](#diffusers.CogVideoXFunControlPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) — Frozen text-encoder. CogVideoX uses [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel); specifically the [t5-v1\_1-xxl](https://huggingface.co/PixArt-alpha/PixArt-alpha/tree/main/t5-v1_1-xxl) variant.
+*   [](#diffusers.CogVideoXFunControlPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) — Tokenizer of class [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
+*   [](#diffusers.CogVideoXFunControlPipeline.transformer)**transformer** ([CogVideoXTransformer3DModel](/docs/diffusers/main/en/api/models/cogvideox_transformer3d#diffusers.CogVideoXTransformer3DModel)) — A text conditioned `CogVideoXTransformer3DModel` to denoise the encoded video latents.
+*   [](#diffusers.CogVideoXFunControlPipeline.scheduler)**scheduler** ([SchedulerMixin](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin)) — A scheduler to be used in combination with `transformer` to denoise the encoded video latents.
+Pipeline for controlled text-to-video generation using CogVideoX Fun.
+This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
+#### \_\_call\_\_
+[](#diffusers.CogVideoXFunControlPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_control.py#L551)
+( prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonecontrol\_video: typing.Optional\[typing.List\[PIL.Image.Image\]\] = Noneheight: typing.Optional\[int\] = Nonewidth: typing.Optional\[int\] = Nonenum\_inference\_steps: int = 50timesteps: typing.Optional\[typing.List\[int\]\] = Noneguidance\_scale: float = 6use\_dynamic\_cfg: bool = Falsenum\_videos\_per\_prompt: int = 1eta: float = 0.0generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Nonecontrol\_video\_latents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneoutput\_type: str = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 226 ) → export const metadata = 'undefined';[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
+Expand 20 parameters
+Parameters
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.control_video)**control\_video** (`List[PIL.Image.Image]`) — The control video to condition the generation on. Must be a list of images/frames of the video. If not provided, `control_video_latents` must be provided.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.height)**height** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) — The height in pixels of the generated image. This is set to 480 by default for the best results.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.width)**width** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) — The width in pixels of the generated image. This is set to 720 by default for the best results.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) — Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, _optional_, defaults to 6.0) — Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — The number of videos to generate per prompt.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) — One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.control_video_latents)**control\_video\_latents** (`torch.Tensor`, _optional_) — Pre-generated control latents, sampled from a Gaussian distribution, to be used as inputs for controlled video generation. If not provided, `control_video` must be provided.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) — The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) — Whether or not to return a `~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput` instead of a plain tuple.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) — A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) — The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
+*   [](#diffusers.CogVideoXFunControlPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int`, defaults to `226`) — Maximum sequence length in encoded prompt. Must be consistent with `self.transformer.config.max_text_seq_length` otherwise may lead to poor results.
+Returns
+export const metadata = 'undefined';
+[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
+export const metadata = 'undefined';
+[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) if `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
+Function invoked when calling the pipeline for generation.
+[](#diffusers.CogVideoXFunControlPipeline.__call__.example)
+Examples:
+Copied
+\>>> import torch
+\>>> from diffusers import CogVideoXFunControlPipeline, DDIMScheduler
+\>>> from diffusers.utils import export\_to\_video, load\_video
+\>>> pipe = CogVideoXFunControlPipeline.from\_pretrained(
+...     "alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose", torch\_dtype=torch.bfloat16
+... )
+\>>> pipe.scheduler = DDIMScheduler.from\_config(pipe.scheduler.config)
+\>>> pipe.to("cuda")
+\>>> control\_video = load\_video(
+...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
+... )
+\>>> prompt = (
+...     "An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and "
+...     "valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in "
+...     "the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, "
+...     "moons, but the remainder of the scene is mostly realistic."
+... )
+\>>> video = pipe(prompt=prompt, control\_video=control\_video).frames\[0\]
+\>>> export\_to\_video(video, "output.mp4", fps=8)
+#### encode\_prompt
+[](#diffusers.CogVideoXFunControlPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_control.py#L253)
+( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
+Parameters
+*   [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) — prompt to be encoded
+*   [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) — Whether to use classifier free guidance or not.
+*   [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
+*   [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.device)**device** — (`torch.device`, _optional_): torch device
+*   [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.dtype)**dtype** — (`torch.dtype`, _optional_): torch dtype
+Encodes the prompt into text encoder hidden states.
+#### fuse\_qkv\_projections
+[](#diffusers.CogVideoXFunControlPipeline.fuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_control.py#L473)
+( )
+Enables fused QKV projections.
+#### unfuse\_qkv\_projections
+[](#diffusers.CogVideoXFunControlPipeline.unfuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_control.py#L478)
+( )
+Disable QKV projection fusion if enabled.
+[](#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput)CogVideoXPipelineOutput
+------------------------------------------------------------------------------------------------
+### class diffusers.pipelines.cogvideo.pipeline\_output.CogVideoXPipelineOutput
+[](#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_output.py#L8)
+( frames: Tensor )
+Parameters
+*   [](#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput.frames)**frames** (`torch.Tensor`, `np.ndarray`, or List\[List\[PIL.Image.Image\]\]) — List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.
+Output class for CogVideo pipelines.
+[< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/cogvideox.md)
+CogVideoX
+[←BLIP-Diffusion](/docs/diffusers/main/en/api/pipelines/blip_diffusion) [CogView3→](/docs/diffusers/main/en/api/pipelines/cogview3)

docs/diffusers/Using Diffusers for HunyuanVideo.md ADDED Viewed

	@@ -0,0 +1,232 @@

+[](#hunyuanvideo)HunyuanVideo
+=============================
+![LoRA](https://img.shields.io/badge/LoRA-d8b4fe?style=flat)
+[HunyuanVideo](https://www.arxiv.org/abs/2412.03603) by Tencent.
+_Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at [this https URL](https://github.com/tencent/HunyuanVideo)._
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
+Recommendations for inference:
+*   Both text encoders should be in `torch.float16`.
+*   Transformer should be in `torch.bfloat16`.
+*   VAE should be in `torch.float16`.
+*   `num_frames` should be of the form `4 * k + 1`, for example `49` or `129`.
+*   For smaller resolution videos, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution images, try higher values (between `7.0` and `12.0`). The default value is `7.0` for HunyuanVideo.
+*   For more information about supported resolutions and other details, please refer to the original repository [here](https://github.com/Tencent/HunyuanVideo/).
+[](#available-models)Available models
+-------------------------------------
+The following models are available for the [`HunyuanVideoPipeline`](text-to-video) pipeline:
+Model name
+Description
+[`hunyuanvideo-community/HunyuanVideo`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo)
+Official HunyuanVideo (guidance-distilled). Performs best at multiple resolutions and frames. Performs best with `guidance_scale=6.0`, `true_cfg_scale=1.0` and without a negative prompt.
+[`https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V`](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V)
+Skywork’s custom finetune of HunyuanVideo (de-distilled). Performs best with `97x544x960` resolution, `guidance_scale=1.0`, `true_cfg_scale=6.0` and a negative prompt.
+The following models are available for the image-to-video pipeline:
+Model name
+Description
+[`Skywork/SkyReels-V1-Hunyuan-I2V`](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-I2V)
+Skywork’s custom finetune of HunyuanVideo (de-distilled). Performs best with `97x544x960` resolution. Performs best at `97x544x960` resolution, `guidance_scale=1.0`, `true_cfg_scale=6.0` and a negative prompt.
+[`hunyuanvideo-community/HunyuanVideo-I2V`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V)
+Tecent’s official HunyuanVideo I2V model. Performs best at resolutions of 480, 720, 960, 1280. A higher `shift` value when initializing the scheduler is recommended (good values are between 7 and 20)
+[](#quantization)Quantization
+-----------------------------
+Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
+Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [HunyuanVideoPipeline](/docs/diffusers/main/en/api/pipelines/hunyuan_video#diffusers.HunyuanVideoPipeline) for inference with bitsandbytes.
+Copied
+import torch
+from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
+from diffusers.utils import export\_to\_video
+quant\_config = DiffusersBitsAndBytesConfig(load\_in\_8bit=True)
+transformer\_8bit = HunyuanVideoTransformer3DModel.from\_pretrained(
+    "hunyuanvideo-community/HunyuanVideo",
+    subfolder="transformer",
+    quantization\_config=quant\_config,
+    torch\_dtype=torch.bfloat16,
+)
+pipeline = HunyuanVideoPipeline.from\_pretrained(
+    "hunyuanvideo-community/HunyuanVideo",
+    transformer=transformer\_8bit,
+    torch\_dtype=torch.float16,
+    device\_map="balanced",
+)
+prompt = "A cat walks on the grass, realistic style."
+video = pipeline(prompt=prompt, num\_frames=61, num\_inference\_steps=30).frames\[0\]
+export\_to\_video(video, "cat.mp4", fps=15)
+[](#diffusers.HunyuanVideoPipeline)HunyuanVideoPipeline
+-------------------------------------------------------
+### class diffusers.HunyuanVideoPipeline
+[](#diffusers.HunyuanVideoPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L144)
+( text\_encoder: LlamaModeltokenizer: LlamaTokenizerFasttransformer: HunyuanVideoTransformer3DModelvae: AutoencoderKLHunyuanVideoscheduler: FlowMatchEulerDiscreteSchedulertext\_encoder\_2: CLIPTextModeltokenizer\_2: CLIPTokenizer )
+Parameters
+*   [](#diffusers.HunyuanVideoPipeline.text_encoder)**text\_encoder** (`LlamaModel`) — [Llava Llama3-8B](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers).
+*   [](#diffusers.HunyuanVideoPipeline.tokenizer)**tokenizer** (`LlamaTokenizer`) — Tokenizer from [Llava Llama3-8B](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers).
+*   [](#diffusers.HunyuanVideoPipeline.transformer)**transformer** ([HunyuanVideoTransformer3DModel](/docs/diffusers/main/en/api/models/hunyuan_video_transformer_3d#diffusers.HunyuanVideoTransformer3DModel)) — Conditional Transformer to denoise the encoded image latents.
+*   [](#diffusers.HunyuanVideoPipeline.scheduler)**scheduler** ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) — A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
+*   [](#diffusers.HunyuanVideoPipeline.vae)**vae** ([AutoencoderKLHunyuanVideo](/docs/diffusers/main/en/api/models/autoencoder_kl_hunyuan_video#diffusers.AutoencoderKLHunyuanVideo)) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
+*   [](#diffusers.HunyuanVideoPipeline.text_encoder_2)**text\_encoder\_2** (`CLIPTextModel`) — [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
+*   [](#diffusers.HunyuanVideoPipeline.tokenizer_2)**tokenizer\_2** (`CLIPTokenizer`) — Tokenizer of class [CLIPTokenizer](https://huggingface.co/docs/transformers/en/model_doc/clip#transformers.CLIPTokenizer).
+Pipeline for text-to-video generation using HunyuanVideo.
+This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
+#### \_\_call\_\_
+[](#diffusers.HunyuanVideoPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L467)
+( prompt: typing.Union\[str, typing.List\[str\]\] = Noneprompt\_2: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt\_2: typing.Union\[str, typing.List\[str\]\] = Noneheight: int = 720width: int = 1280num\_frames: int = 129num\_inference\_steps: int = 50sigmas: typing.List\[float\] = Nonetrue\_cfg\_scale: float = 1.0guidance\_scale: float = 6.0num\_videos\_per\_prompt: typing.Optional\[int\] = 1generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Nonepooled\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneprompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_pooled\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Noneoutput\_type: typing.Optional\[str\] = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]prompt\_template: typing.Dict\[str, typing.Any\] = {'template': '<|start\_header\_id|>system<|end\_header\_id|>\\n\\nDescribe the video by detailing the following aspects: 1. The main content and theme of the video.2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.4. background environment, light, style and atmosphere.5. camera angles, movements, and transitions used in the video:<|eot\_id|><|start\_header\_id|>user<|end\_header\_id|>\\n\\n{}<|eot\_id|>', 'crop\_start': 95}max\_sequence\_length: int = 256 ) → export const metadata = 'undefined';`~HunyuanVideoPipelineOutput` or `tuple`
+Expand 24 parameters
+Parameters
+*   [](#diffusers.HunyuanVideoPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.prompt_2)**prompt\_2** (`str` or `List[str]`, _optional_) — The prompt or prompts to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is will be used instead.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `true_cfg_scale` is not greater than `1`).
+*   [](#diffusers.HunyuanVideoPipeline.__call__.negative_prompt_2)**negative\_prompt\_2** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `negative_prompt` is used in all the text-encoders.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.height)**height** (`int`, defaults to `720`) — The height in pixels of the generated image.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.width)**width** (`int`, defaults to `1280`) — The width in pixels of the generated image.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `129`) — The number of frames in the generated video.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, defaults to `50`) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.sigmas)**sigmas** (`List[float]`, _optional_) — Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.true_cfg_scale)**true\_cfg\_scale** (`float`, _optional_, defaults to 1.0) — When > 1.0 and a provided `negative_prompt`, enables true classifier-free guidance.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, defaults to `6.0`) — Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality. Note that the only available HunyuanVideo model is CFG-distilled, which means that traditional guidance between unconditional and conditional latent is not applied.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — The number of images to generate per prompt.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) — A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random `generator`.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the `prompt` input argument.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.pooled_prompt_embeds)**pooled\_prompt\_embeds** (`torch.FloatTensor`, _optional_) — Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, pooled text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.negative_pooled_prompt_embeds)**negative\_pooled\_prompt\_embeds** (`torch.FloatTensor`, _optional_) — Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, pooled negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) — The output format of the generated image. Choose between `PIL.Image` or `np.array`.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) — Whether or not to return a `HunyuanVideoPipelineOutput` instead of a plain tuple.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) — A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+*   [](#diffusers.HunyuanVideoPipeline.__call__.clip_skip)**clip\_skip** (`int`, _optional_) — Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, _optional_) — A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of each denoising step during the inference. with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
+*   [](#diffusers.HunyuanVideoPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) — The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
+Returns
+export const metadata = 'undefined';
+`~HunyuanVideoPipelineOutput` or `tuple`
+export const metadata = 'undefined';
+If `return_dict` is `True`, `HunyuanVideoPipelineOutput` is returned, otherwise a `tuple` is returned where the first element is a list with the generated images and the second element is a list of `bool`s indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
+The call function to the pipeline for generation.
+[](#diffusers.HunyuanVideoPipeline.__call__.example)
+Examples:
+Copied
+\>>> import torch
+\>>> from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
+\>>> from diffusers.utils import export\_to\_video
+\>>> model\_id = "hunyuanvideo-community/HunyuanVideo"
+\>>> transformer = HunyuanVideoTransformer3DModel.from\_pretrained(
+...     model\_id, subfolder="transformer", torch\_dtype=torch.bfloat16
+... )
+\>>> pipe = HunyuanVideoPipeline.from\_pretrained(model\_id, transformer=transformer, torch\_dtype=torch.float16)
+\>>> pipe.vae.enable\_tiling()
+\>>> pipe.to("cuda")
+\>>> output = pipe(
+...     prompt="A cat walks on the grass, realistic",
+...     height=320,
+...     width=512,
+...     num\_frames=61,
+...     num\_inference\_steps=30,
+... ).frames\[0\]
+\>>> export\_to\_video(output, "output.mp4", fps=15)
+#### disable\_vae\_slicing
+[](#diffusers.HunyuanVideoPipeline.disable_vae_slicing)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L425)
+( )
+Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step.
+#### disable\_vae\_tiling
+[](#diffusers.HunyuanVideoPipeline.disable_vae_tiling)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L440)
+( )
+Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step.
+#### enable\_vae\_slicing
+[](#diffusers.HunyuanVideoPipeline.enable_vae_slicing)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L418)
+( )
+Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
+#### enable\_vae\_tiling
+[](#diffusers.HunyuanVideoPipeline.enable_vae_tiling)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L432)
+( )
+Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.
+[](#diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput)HunyuanVideoPipelineOutput
+-----------------------------------------------------------------------------------------------------------
+### class diffusers.pipelines.hunyuan\_video.pipeline\_output.HunyuanVideoPipelineOutput
+[](#diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_output.py#L8)
+( frames: Tensor )
+Parameters
+*   [](#diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput.frames)**frames** (`torch.Tensor`, `np.ndarray`, or List\[List\[PIL.Image.Image\]\]) — List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.
+Output class for HunyuanVideo pipelines.
+[< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/hunyuan_video.md)
+LTX Video
+[←Hunyuan-DiT](/docs/diffusers/main/en/api/pipelines/hunyuandit) [I2VGen-XL→](/docs/diffusers/main/en/api/pipelines/i2vgenxl)

docs/diffusers/Using Diffusers for LTX Video.md ADDED Viewed

	@@ -0,0 +1,421 @@

+[](#ltx-video)LTX Video
+=======================
+![LoRA](https://img.shields.io/badge/LoRA-d8b4fe?style=flat)
+[LTX Video](https://huggingface.co/Lightricks/LTX-Video) is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 24 FPS videos at a 768x512 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content. We provide a model for both text-to-video as well as image + text-to-video usecases.
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
+Available models:
+Model name
+Recommended dtype
+[`LTX Video 0.9.0`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.safetensors)
+`torch.bfloat16`
+[`LTX Video 0.9.1`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.1.safetensors)
+`torch.bfloat16`
+Note: The recommended dtype is for the transformer component. The VAE and text encoders can be either `torch.float32`, `torch.bfloat16` or `torch.float16` but the recommended dtype is `torch.bfloat16` as used in the original repository.
+[](#loading-single-files)Loading Single Files
+---------------------------------------------
+Loading the original LTX Video checkpoints is also possible with `~ModelMixin.from_single_file`. We recommend using `from_single_file` for the Lightricks series of models, as they plan to release multiple models in the future in the single file format.
+Copied
+import torch
+from diffusers import AutoencoderKLLTXVideo, LTXImageToVideoPipeline, LTXVideoTransformer3DModel
+\# \`single\_file\_url\` could also be https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.1.safetensors
+single\_file\_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
+transformer = LTXVideoTransformer3DModel.from\_single\_file(
+  single\_file\_url, torch\_dtype=torch.bfloat16
+)
+vae = AutoencoderKLLTXVideo.from\_single\_file(single\_file\_url, torch\_dtype=torch.bfloat16)
+pipe = LTXImageToVideoPipeline.from\_pretrained(
+  "Lightricks/LTX-Video", transformer=transformer, vae=vae, torch\_dtype=torch.bfloat16
+)
+\# ... inference code ...
+Alternatively, the pipeline can be used to load the weights with `~FromSingleFileMixin.from_single_file`.
+Copied
+import torch
+from diffusers import LTXImageToVideoPipeline
+from transformers import T5EncoderModel, T5Tokenizer
+single\_file\_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
+text\_encoder = T5EncoderModel.from\_pretrained(
+  "Lightricks/LTX-Video", subfolder="text\_encoder", torch\_dtype=torch.bfloat16
+)
+tokenizer = T5Tokenizer.from\_pretrained(
+  "Lightricks/LTX-Video", subfolder="tokenizer", torch\_dtype=torch.bfloat16
+)
+pipe = LTXImageToVideoPipeline.from\_single\_file(
+  single\_file\_url, text\_encoder=text\_encoder, tokenizer=tokenizer, torch\_dtype=torch.bfloat16
+)
+Loading [LTX GGUF checkpoints](https://huggingface.co/city96/LTX-Video-gguf) are also supported:
+Copied
+import torch
+from diffusers.utils import export\_to\_video
+from diffusers import LTXPipeline, LTXVideoTransformer3DModel, GGUFQuantizationConfig
+ckpt\_path = (
+    "https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3\_K\_S.gguf"
+)
+transformer = LTXVideoTransformer3DModel.from\_single\_file(
+    ckpt\_path,
+    quantization\_config=GGUFQuantizationConfig(compute\_dtype=torch.bfloat16),
+    torch\_dtype=torch.bfloat16,
+)
+pipe = LTXPipeline.from\_pretrained(
+    "Lightricks/LTX-Video",
+    transformer=transformer,
+    torch\_dtype=torch.bfloat16,
+)
+pipe.enable\_model\_cpu\_offload()
+prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
+negative\_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
+video = pipe(
+    prompt=prompt,
+    negative\_prompt=negative\_prompt,
+    width=704,
+    height=480,
+    num\_frames=161,
+    num\_inference\_steps=50,
+).frames\[0\]
+export\_to\_video(video, "output\_gguf\_ltx.mp4", fps=24)
+Make sure to read the [documentation on GGUF](../../quantization/gguf) to learn more about our GGUF support.
+Loading and running inference with [LTX Video 0.9.1](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.1.safetensors) weights.
+Copied
+import torch
+from diffusers import LTXPipeline
+from diffusers.utils import export\_to\_video
+pipe = LTXPipeline.from\_pretrained("a-r-r-o-w/LTX-Video-0.9.1-diffusers", torch\_dtype=torch.bfloat16)
+pipe.to("cuda")
+prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
+negative\_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
+video = pipe(
+    prompt=prompt,
+    negative\_prompt=negative\_prompt,
+    width=768,
+    height=512,
+    num\_frames=161,
+    decode\_timestep=0.03,
+    decode\_noise\_scale=0.025,
+    num\_inference\_steps=50,
+).frames\[0\]
+export\_to\_video(video, "output.mp4", fps=24)
+Refer to [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox#memory-optimization) to learn more about optimizing memory consumption.
+[](#quantization)Quantization
+-----------------------------
+Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
+Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [LTXPipeline](/docs/diffusers/main/en/api/pipelines/ltx_video#diffusers.LTXPipeline) for inference with bitsandbytes.
+Copied
+import torch
+from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LTXVideoTransformer3DModel, LTXPipeline
+from diffusers.utils import export\_to\_video
+from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
+quant\_config = BitsAndBytesConfig(load\_in\_8bit=True)
+text\_encoder\_8bit = T5EncoderModel.from\_pretrained(
+    "Lightricks/LTX-Video",
+    subfolder="text\_encoder",
+    quantization\_config=quant\_config,
+    torch\_dtype=torch.float16,
+)
+quant\_config = DiffusersBitsAndBytesConfig(load\_in\_8bit=True)
+transformer\_8bit = LTXVideoTransformer3DModel.from\_pretrained(
+    "Lightricks/LTX-Video",
+    subfolder="transformer",
+    quantization\_config=quant\_config,
+    torch\_dtype=torch.float16,
+)
+pipeline = LTXPipeline.from\_pretrained(
+    "Lightricks/LTX-Video",
+    text\_encoder=text\_encoder\_8bit,
+    transformer=transformer\_8bit,
+    torch\_dtype=torch.float16,
+    device\_map="balanced",
+)
+prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
+video = pipeline(prompt=prompt, num\_frames=161, num\_inference\_steps=50).frames\[0\]
+export\_to\_video(video, "ship.mp4", fps=24)
+[](#diffusers.LTXPipeline)LTXPipeline
+-------------------------------------
+### class diffusers.LTXPipeline
+[](#diffusers.LTXPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx.py#L143)
+( scheduler: FlowMatchEulerDiscreteSchedulervae: AutoencoderKLLTXVideotext\_encoder: T5EncoderModeltokenizer: T5TokenizerFasttransformer: LTXVideoTransformer3DModel )
+Parameters
+*   [](#diffusers.LTXPipeline.transformer)**transformer** ([LTXVideoTransformer3DModel](/docs/diffusers/main/en/api/models/ltx_video_transformer3d#diffusers.LTXVideoTransformer3DModel)) — Conditional Transformer architecture to denoise the encoded video latents.
+*   [](#diffusers.LTXPipeline.scheduler)**scheduler** ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) — A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
+*   [](#diffusers.LTXPipeline.vae)**vae** ([AutoencoderKLLTXVideo](/docs/diffusers/main/en/api/models/autoencoderkl_ltx_video#diffusers.AutoencoderKLLTXVideo)) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
+*   [](#diffusers.LTXPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) — [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically the [google/t5-v1\_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
+*   [](#diffusers.LTXPipeline.tokenizer)**tokenizer** (`CLIPTokenizer`) — Tokenizer of class [CLIPTokenizer](https://huggingface.co/docs/transformers/en/model_doc/clip#transformers.CLIPTokenizer).
+*   [](#diffusers.LTXPipeline.tokenizer)**tokenizer** (`T5TokenizerFast`) — Second Tokenizer of class [T5TokenizerFast](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5TokenizerFast).
+Pipeline for text-to-video generation.
+Reference: [https://github.com/Lightricks/LTX-Video](https://github.com/Lightricks/LTX-Video)
+#### \_\_call\_\_
+[](#diffusers.LTXPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx.py#L500)
+( prompt: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Noneheight: int = 512width: int = 704num\_frames: int = 161frame\_rate: int = 25num\_inference\_steps: int = 50timesteps: typing.List\[int\] = Noneguidance\_scale: float = 3num\_videos\_per\_prompt: typing.Optional\[int\] = 1generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Noneprompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonedecode\_timestep: typing.Union\[float, typing.List\[float\]\] = 0.0decode\_noise\_scale: typing.Union\[float, typing.List\[float\], NoneType\] = Noneoutput\_type: typing.Optional\[str\] = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Optional\[typing.Callable\[\[int, int, typing.Dict\], NoneType\]\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 128 ) → export const metadata = 'undefined';`~pipelines.ltx.LTXPipelineOutput` or `tuple`
+Expand 22 parameters
+Parameters
+*   [](#diffusers.LTXPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
+*   [](#diffusers.LTXPipeline.__call__.height)**height** (`int`, defaults to `512`) — The height in pixels of the generated image. This is set to 480 by default for the best results.
+*   [](#diffusers.LTXPipeline.__call__.width)**width** (`int`, defaults to `704`) — The width in pixels of the generated image. This is set to 848 by default for the best results.
+*   [](#diffusers.LTXPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `161`) — The number of video frames to generate
+*   [](#diffusers.LTXPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
+*   [](#diffusers.LTXPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) — Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
+*   [](#diffusers.LTXPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, defaults to `3` ) — Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
+*   [](#diffusers.LTXPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — The number of videos to generate per prompt.
+*   [](#diffusers.LTXPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) — One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
+*   [](#diffusers.LTXPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
+*   [](#diffusers.LTXPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.LTXPipeline.__call__.prompt_attention_mask)**prompt\_attention\_mask** (`torch.Tensor`, _optional_) — Pre-generated attention mask for text embeddings.
+*   [](#diffusers.LTXPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.LTXPipeline.__call__.negative_prompt_attention_mask)**negative\_prompt\_attention\_mask** (`torch.FloatTensor`, _optional_) — Pre-generated attention mask for negative text embeddings.
+*   [](#diffusers.LTXPipeline.__call__.decode_timestep)**decode\_timestep** (`float`, defaults to `0.0`) — The timestep at which generated video is decoded.
+*   [](#diffusers.LTXPipeline.__call__.decode_noise_scale)**decode\_noise\_scale** (`float`, defaults to `None`) — The interpolation factor between random noise and denoised latents at the decode timestep.
+*   [](#diffusers.LTXPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) — The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+*   [](#diffusers.LTXPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) — Whether or not to return a `~pipelines.ltx.LTXPipelineOutput` instead of a plain tuple.
+*   [](#diffusers.LTXPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) — A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+*   [](#diffusers.LTXPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
+*   [](#diffusers.LTXPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) — The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
+*   [](#diffusers.LTXPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int` defaults to `128` ) — Maximum sequence length to use with the `prompt`.
+Returns
+export const metadata = 'undefined';
+`~pipelines.ltx.LTXPipelineOutput` or `tuple`
+export const metadata = 'undefined';
+If `return_dict` is `True`, `~pipelines.ltx.LTXPipelineOutput` is returned, otherwise a `tuple` is returned where the first element is a list with the generated images.
+Function invoked when calling the pipeline for generation.
+[](#diffusers.LTXPipeline.__call__.example)
+Examples:
+Copied
+\>>> import torch
+\>>> from diffusers import LTXPipeline
+\>>> from diffusers.utils import export\_to\_video
+\>>> pipe = LTXPipeline.from\_pretrained("Lightricks/LTX-Video", torch\_dtype=torch.bfloat16)
+\>>> pipe.to("cuda")
+\>>> prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
+\>>> negative\_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
+\>>> video = pipe(
+...     prompt=prompt,
+...     negative\_prompt=negative\_prompt,
+...     width=704,
+...     height=480,
+...     num\_frames=161,
+...     num\_inference\_steps=50,
+... ).frames\[0\]
+\>>> export\_to\_video(video, "output.mp4", fps=24)
+#### encode\_prompt
+[](#diffusers.LTXPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx.py#L256)
+( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneprompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 128device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
+Parameters
+*   [](#diffusers.LTXPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) — prompt to be encoded
+*   [](#diffusers.LTXPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.LTXPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) — Whether to use classifier free guidance or not.
+*   [](#diffusers.LTXPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
+*   [](#diffusers.LTXPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.LTXPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.LTXPipeline.encode_prompt.device)**device** — (`torch.device`, _optional_): torch device
+*   [](#diffusers.LTXPipeline.encode_prompt.dtype)**dtype** — (`torch.dtype`, _optional_): torch dtype
+Encodes the prompt into text encoder hidden states.
+[](#diffusers.LTXImageToVideoPipeline)LTXImageToVideoPipeline
+-------------------------------------------------------------
+### class diffusers.LTXImageToVideoPipeline
+[](#diffusers.LTXImageToVideoPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py#L162)
+( scheduler: FlowMatchEulerDiscreteSchedulervae: AutoencoderKLLTXVideotext\_encoder: T5EncoderModeltokenizer: T5TokenizerFasttransformer: LTXVideoTransformer3DModel )
+Parameters
+*   [](#diffusers.LTXImageToVideoPipeline.transformer)**transformer** ([LTXVideoTransformer3DModel](/docs/diffusers/main/en/api/models/ltx_video_transformer3d#diffusers.LTXVideoTransformer3DModel)) — Conditional Transformer architecture to denoise the encoded video latents.
+*   [](#diffusers.LTXImageToVideoPipeline.scheduler)**scheduler** ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) — A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
+*   [](#diffusers.LTXImageToVideoPipeline.vae)**vae** ([AutoencoderKLLTXVideo](/docs/diffusers/main/en/api/models/autoencoderkl_ltx_video#diffusers.AutoencoderKLLTXVideo)) — Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
+*   [](#diffusers.LTXImageToVideoPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) — [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically the [google/t5-v1\_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
+*   [](#diffusers.LTXImageToVideoPipeline.tokenizer)**tokenizer** (`CLIPTokenizer`) — Tokenizer of class [CLIPTokenizer](https://huggingface.co/docs/transformers/en/model_doc/clip#transformers.CLIPTokenizer).
+*   [](#diffusers.LTXImageToVideoPipeline.tokenizer)**tokenizer** (`T5TokenizerFast`) — Second Tokenizer of class [T5TokenizerFast](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5TokenizerFast).
+Pipeline for image-to-video generation.
+Reference: [https://github.com/Lightricks/LTX-Video](https://github.com/Lightricks/LTX-Video)
+#### \_\_call\_\_
+[](#diffusers.LTXImageToVideoPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py#L559)
+( image: typing.Union\[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List\[PIL.Image.Image\], typing.List\[numpy.ndarray\], typing.List\[torch.Tensor\]\] = Noneprompt: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Noneheight: int = 512width: int = 704num\_frames: int = 161frame\_rate: int = 25num\_inference\_steps: int = 50timesteps: typing.List\[int\] = Noneguidance\_scale: float = 3num\_videos\_per\_prompt: typing.Optional\[int\] = 1generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Noneprompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonedecode\_timestep: typing.Union\[float, typing.List\[float\]\] = 0.0decode\_noise\_scale: typing.Union\[float, typing.List\[float\], NoneType\] = Noneoutput\_type: typing.Optional\[str\] = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Optional\[typing.Callable\[\[int, int, typing.Dict\], NoneType\]\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 128 ) → export const metadata = 'undefined';`~pipelines.ltx.LTXPipelineOutput` or `tuple`
+Expand 23 parameters
+Parameters
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.image)**image** (`PipelineImageInput`) — The input image to condition the generation on. Must be an image, a list of images or a `torch.Tensor`.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.height)**height** (`int`, defaults to `512`) — The height in pixels of the generated image. This is set to 480 by default for the best results.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.width)**width** (`int`, defaults to `704`) — The width in pixels of the generated image. This is set to 848 by default for the best results.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `161`) — The number of video frames to generate
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) — Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, defaults to `3` ) — Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — The number of videos to generate per prompt.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) — One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) — Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.prompt_attention_mask)**prompt\_attention\_mask** (`torch.Tensor`, _optional_) — Pre-generated attention mask for text embeddings.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) — Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.negative_prompt_attention_mask)**negative\_prompt\_attention\_mask** (`torch.FloatTensor`, _optional_) — Pre-generated attention mask for negative text embeddings.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.decode_timestep)**decode\_timestep** (`float`, defaults to `0.0`) — The timestep at which generated video is decoded.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.decode_noise_scale)**decode\_noise\_scale** (`float`, defaults to `None`) — The interpolation factor between random noise and denoised latents at the decode timestep.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) — The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) — Whether or not to return a `~pipelines.ltx.LTXPipelineOutput` instead of a plain tuple.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) — A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) — A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) — The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
+*   [](#diffusers.LTXImageToVideoPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int` defaults to `128` ) — Maximum sequence length to use with the `prompt`.
+Returns
+export const metadata = 'undefined';
+`~pipelines.ltx.LTXPipelineOutput` or `tuple`
+export const metadata = 'undefined';
+If `return_dict` is `True`, `~pipelines.ltx.LTXPipelineOutput` is returned, otherwise a `tuple` is returned where the first element is a list with the generated images.
+Function invoked when calling the pipeline for generation.
+[](#diffusers.LTXImageToVideoPipeline.__call__.example)
+Examples:
+Copied
+\>>> import torch
+\>>> from diffusers import LTXImageToVideoPipeline
+\>>> from diffusers.utils import export\_to\_video, load\_image
+\>>> pipe = LTXImageToVideoPipeline.from\_pretrained("Lightricks/LTX-Video", torch\_dtype=torch.bfloat16)
+\>>> pipe.to("cuda")
+\>>> image = load\_image(
+...     "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png"
+... )
+\>>> prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background. Flames engulf the structure, with smoke billowing into the air. Firefighters in protective gear rush to the scene, a fire truck labeled '38' visible behind them. The girl's neutral expression contrasts sharply with the chaos of the fire, creating a poignant and emotionally charged scene."
+\>>> negative\_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
+\>>> video = pipe(
+...     image=image,
+...     prompt=prompt,
+...     negative\_prompt=negative\_prompt,
+...     width=704,
+...     height=480,
+...     num\_frames=161,
+...     num\_inference\_steps=50,
+... ).frames\[0\]
+\>>> export\_to\_video(video, "output.mp4", fps=24)
+#### encode\_prompt
+[](#diffusers.LTXImageToVideoPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py#L279)
+( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneprompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 128device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
+Parameters
+*   [](#diffusers.LTXImageToVideoPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) — prompt to be encoded
+*   [](#diffusers.LTXImageToVideoPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.LTXImageToVideoPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) — Whether to use classifier free guidance or not.
+*   [](#diffusers.LTXImageToVideoPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
+*   [](#diffusers.LTXImageToVideoPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.LTXImageToVideoPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.LTXImageToVideoPipeline.encode_prompt.device)**device** — (`torch.device`, _optional_): torch device
+*   [](#diffusers.LTXImageToVideoPipeline.encode_prompt.dtype)**dtype** — (`torch.dtype`, _optional_): torch dtype
+Encodes the prompt into text encoder hidden states.
+[](#diffusers.pipelines.ltx.pipeline_output.LTXPipelineOutput)LTXPipelineOutput
+-------------------------------------------------------------------------------
+### class diffusers.pipelines.ltx.pipeline\_output.LTXPipelineOutput
+[](#diffusers.pipelines.ltx.pipeline_output.LTXPipelineOutput)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_output.py#L8)
+( frames: Tensor )
+Parameters
+*   [](#diffusers.pipelines.ltx.pipeline_output.LTXPipelineOutput.frames)**frames** (`torch.Tensor`, `np.ndarray`, or List\[List\[PIL.Image.Image\]\]) — List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.
+Output class for LTX pipelines.
+[< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/ltx_video.md)
+[←LEDITS++](/docs/diffusers/main/en/api/pipelines/ledits_pp) [Lumina 2.0→](/docs/diffusers/main/en/api/pipelines/lumina2)

docs/diffusers/Using Diffusers for Wan.md ADDED Viewed

	@@ -0,0 +1,307 @@

+[](#wan)Wan
+===========
+[Wan 2.1](https://github.com/Wan-Video/Wan2.1) by the Alibaba Wan Team.
+Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
+Recommendations for inference:
+*   VAE in `torch.float32` for better decoding quality.
+*   `num_frames` should be of the form `4 * k + 1`, for example `49` or `81`.
+*   For smaller resolution videos, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution videos, try higher values (between `7.0` and `12.0`). The default value is `3.0` for Wan.
+### [](#using-a-custom-scheduler)Using a custom scheduler
+Wan can be used with many different schedulers, each with their own benefits regarding speed and generation quality. By default, Wan uses the `UniPCMultistepScheduler(prediction_type="flow_prediction", use_flow_sigmas=True, flow_shift=3.0)` scheduler. You can use a different scheduler as follows:
+Copied
+from diffusers import FlowMatchEulerDiscreteScheduler, UniPCMultistepScheduler, WanPipeline
+scheduler\_a = FlowMatchEulerDiscreteScheduler(shift=5.0)
+scheduler\_b = UniPCMultistepScheduler(prediction\_type="flow\_prediction", use\_flow\_sigmas=True, flow\_shift=4.0)
+pipe = WanPipeline.from\_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", scheduler=<CUSTOM\_SCHEDULER\_HERE>)
+\# or,
+pipe.scheduler = <CUSTOM\_SCHEDULER\_HERE>
+### [](#using-single-file-loading-with-wan)Using single file loading with Wan
+The `WanTransformer3DModel` and `AutoencoderKLWan` models support loading checkpoints in their original format via the `from_single_file` loading method.
+Copied
+import torch
+from diffusers import WanPipeline, WanTransformer3DModel
+ckpt\_path = "https://huggingface.co/Comfy-Org/Wan\_2.1\_ComfyUI\_repackaged/blob/main/split\_files/diffusion\_models/wan2.1\_t2v\_1.3B\_bf16.safetensors"
+transformer = WanTransformer3DModel.from\_single\_file(ckpt\_path, torch\_dtype=torch.bfloat16)
+pipe = WanPipeline.from\_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", transformer=transformer)
+[](#diffusers.WanPipeline)WanPipeline
+-------------------------------------
+### class diffusers.WanPipeline
+[](#diffusers.WanPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan.py#L93)
+( tokenizer: AutoTokenizertext\_encoder: UMT5EncoderModeltransformer: WanTransformer3DModelvae: AutoencoderKLWanscheduler: FlowMatchEulerDiscreteScheduler )
+Parameters
+*   [](#diffusers.WanPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) — Tokenizer from [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5Tokenizer), specifically the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant.
+*   [](#diffusers.WanPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) — [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant.
+*   [](#diffusers.WanPipeline.transformer)**transformer** ([WanTransformer3DModel](/docs/diffusers/main/en/api/models/wan_transformer_3d#diffusers.WanTransformer3DModel)) — Conditional Transformer to denoise the input latents.
+*   [](#diffusers.WanPipeline.scheduler)**scheduler** ([UniPCMultistepScheduler](/docs/diffusers/main/en/api/schedulers/unipc#diffusers.UniPCMultistepScheduler)) — A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
+*   [](#diffusers.WanPipeline.vae)**vae** ([AutoencoderKLWan](/docs/diffusers/main/en/api/models/autoencoder_kl_wan#diffusers.AutoencoderKLWan)) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
+Pipeline for text-to-video generation using Wan.
+This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
+#### \_\_call\_\_
+[](#diffusers.WanPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan.py#L359)
+( prompt: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\]\] = Noneheight: int = 480width: int = 832num\_frames: int = 81num\_inference\_steps: int = 50guidance\_scale: float = 5.0num\_videos\_per\_prompt: typing.Optional\[int\] = 1generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneoutput\_type: typing.Optional\[str\] = 'np'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 512 ) → export const metadata = 'undefined';`~WanPipelineOutput` or `tuple`
+Expand 16 parameters
+Parameters
+*   [](#diffusers.WanPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
+*   [](#diffusers.WanPipeline.__call__.height)**height** (`int`, defaults to `480`) — The height in pixels of the generated image.
+*   [](#diffusers.WanPipeline.__call__.width)**width** (`int`, defaults to `832`) — The width in pixels of the generated image.
+*   [](#diffusers.WanPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `81`) — The number of frames in the generated video.
+*   [](#diffusers.WanPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, defaults to `50`) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
+*   [](#diffusers.WanPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, defaults to `5.0`) — Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
+*   [](#diffusers.WanPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — The number of images to generate per prompt.
+*   [](#diffusers.WanPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) — A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
+*   [](#diffusers.WanPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random `generator`.
+*   [](#diffusers.WanPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the `prompt` input argument.
+*   [](#diffusers.WanPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) — The output format of the generated image. Choose between `PIL.Image` or `np.array`.
+*   [](#diffusers.WanPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) — Whether or not to return a `WanPipelineOutput` instead of a plain tuple.
+*   [](#diffusers.WanPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) — A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+*   [](#diffusers.WanPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, _optional_) — A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of each denoising step during the inference. with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
+*   [](#diffusers.WanPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) — The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
+*   [](#diffusers.WanPipeline.__call__.autocast_dtype)**autocast\_dtype** (`torch.dtype`, _optional_, defaults to `torch.bfloat16`) — The dtype to use for the torch.amp.autocast.
+Returns
+export const metadata = 'undefined';
+`~WanPipelineOutput` or `tuple`
+export const metadata = 'undefined';
+If `return_dict` is `True`, `WanPipelineOutput` is returned, otherwise a `tuple` is returned where the first element is a list with the generated images and the second element is a list of `bool`s indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
+The call function to the pipeline for generation.
+[](#diffusers.WanPipeline.__call__.example)
+Examples:
+Copied
+\>>> import torch
+\>>> from diffusers.utils import export\_to\_video
+\>>> from diffusers import AutoencoderKLWan, WanPipeline
+\>>> from diffusers.schedulers.scheduling\_unipc\_multistep import UniPCMultistepScheduler
+\>>> \# Available models: Wan-AI/Wan2.1-T2V-14B-Diffusers, Wan-AI/Wan2.1-T2V-1.3B-Diffusers
+\>>> model\_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"
+\>>> vae = AutoencoderKLWan.from\_pretrained(model\_id, subfolder="vae", torch\_dtype=torch.float32)
+\>>> pipe = WanPipeline.from\_pretrained(model\_id, vae=vae, torch\_dtype=torch.bfloat16)
+\>>> flow\_shift = 5.0  \# 5.0 for 720P, 3.0 for 480P
+\>>> pipe.scheduler = UniPCMultistepScheduler.from\_config(pipe.scheduler.config, flow\_shift=flow\_shift)
+\>>> pipe.to("cuda")
+\>>> prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
+\>>> negative\_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
+\>>> output = pipe(
+...     prompt=prompt,
+...     negative\_prompt=negative\_prompt,
+...     height=720,
+...     width=1280,
+...     num\_frames=81,
+...     guidance\_scale=5.0,
+... ).frames\[0\]
+\>>> export\_to\_video(output, "output.mp4", fps=16)
+#### encode\_prompt
+[](#diffusers.WanPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan.py#L181)
+( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
+Parameters
+*   [](#diffusers.WanPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) — prompt to be encoded
+*   [](#diffusers.WanPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.WanPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) — Whether to use classifier free guidance or not.
+*   [](#diffusers.WanPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
+*   [](#diffusers.WanPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.WanPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.WanPipeline.encode_prompt.device)**device** — (`torch.device`, _optional_): torch device
+*   [](#diffusers.WanPipeline.encode_prompt.dtype)**dtype** — (`torch.dtype`, _optional_): torch dtype
+Encodes the prompt into text encoder hidden states.
+[](#diffusers.WanImageToVideoPipeline)WanImageToVideoPipeline
+-------------------------------------------------------------
+### class diffusers.WanImageToVideoPipeline
+[](#diffusers.WanImageToVideoPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan_i2v.py#L124)
+( tokenizer: AutoTokenizertext\_encoder: UMT5EncoderModelimage\_encoder: CLIPVisionModelimage\_processor: CLIPImageProcessortransformer: WanTransformer3DModelvae: AutoencoderKLWanscheduler: FlowMatchEulerDiscreteScheduler )
+Parameters
+*   [](#diffusers.WanImageToVideoPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) — Tokenizer from [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5Tokenizer), specifically the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant.
+*   [](#diffusers.WanImageToVideoPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) — [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant.
+*   [](#diffusers.WanImageToVideoPipeline.image_encoder)**image\_encoder** (`CLIPVisionModel`) — [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPVisionModel), specifically the [clip-vit-huge-patch14](https://github.com/mlfoundations/open_clip/blob/main/docs/PRETRAINED.md#vit-h14-xlm-roberta-large) variant.
+*   [](#diffusers.WanImageToVideoPipeline.transformer)**transformer** ([WanTransformer3DModel](/docs/diffusers/main/en/api/models/wan_transformer_3d#diffusers.WanTransformer3DModel)) — Conditional Transformer to denoise the input latents.
+*   [](#diffusers.WanImageToVideoPipeline.scheduler)**scheduler** ([UniPCMultistepScheduler](/docs/diffusers/main/en/api/schedulers/unipc#diffusers.UniPCMultistepScheduler)) — A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
+*   [](#diffusers.WanImageToVideoPipeline.vae)**vae** ([AutoencoderKLWan](/docs/diffusers/main/en/api/models/autoencoder_kl_wan#diffusers.AutoencoderKLWan)) — Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
+Pipeline for image-to-video generation using Wan.
+This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
+#### \_\_call\_\_
+[](#diffusers.WanImageToVideoPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan_i2v.py#L441)
+( image: typing.Union\[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List\[PIL.Image.Image\], typing.List\[numpy.ndarray\], typing.List\[torch.Tensor\]\]prompt: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\]\] = Noneheight: int = 480width: int = 832num\_frames: int = 81num\_inference\_steps: int = 50guidance\_scale: float = 5.0num\_videos\_per\_prompt: typing.Optional\[int\] = 1generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneoutput\_type: typing.Optional\[str\] = 'np'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 512 ) → export const metadata = 'undefined';`~WanPipelineOutput` or `tuple`
+Expand 20 parameters
+Parameters
+*   [](#diffusers.WanImageToVideoPipeline.__call__.image)**image** (`PipelineImageInput`) — The input image to condition the generation on. Must be an image, a list of images or a `torch.Tensor`.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.WanImageToVideoPipeline.__call__.height)**height** (`int`, defaults to `480`) — The height of the generated video.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.width)**width** (`int`, defaults to `832`) — The width of the generated video.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `81`) — The number of frames in the generated video.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, defaults to `50`) — The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, defaults to `5.0`) — Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — The number of images to generate per prompt.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) — A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) — Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random `generator`.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the `prompt` input argument.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) — The output format of the generated image. Choose between `PIL.Image` or `np.array`.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) — Whether or not to return a `WanPipelineOutput` instead of a plain tuple.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) — A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
+*   [](#diffusers.WanImageToVideoPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, _optional_) — A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of each denoising step during the inference. with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) — The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int`, _optional_, defaults to `512`) — The maximum sequence length of the prompt.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.shift)**shift** (`float`, _optional_, defaults to `5.0`) — The shift of the flow.
+*   [](#diffusers.WanImageToVideoPipeline.__call__.autocast_dtype)**autocast\_dtype** (`torch.dtype`, _optional_, defaults to `torch.bfloat16`) — The dtype to use for the torch.amp.autocast.
+Returns
+export const metadata = 'undefined';
+`~WanPipelineOutput` or `tuple`
+export const metadata = 'undefined';
+If `return_dict` is `True`, `WanPipelineOutput` is returned, otherwise a `tuple` is returned where the first element is a list with the generated images and the second element is a list of `bool`s indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.
+The call function to the pipeline for generation.
+[](#diffusers.WanImageToVideoPipeline.__call__.example)
+Examples:
+Copied
+\>>> import torch
+\>>> import numpy as np
+\>>> from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
+\>>> from diffusers.utils import export\_to\_video, load\_image
+\>>> from transformers import CLIPVisionModel
+\>>> \# Available models: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
+\>>> model\_id = "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers"
+\>>> image\_encoder = CLIPVisionModel.from\_pretrained(
+...     model\_id, subfolder="image\_encoder", torch\_dtype=torch.float32
+... )
+\>>> vae = AutoencoderKLWan.from\_pretrained(model\_id, subfolder="vae", torch\_dtype=torch.float32)
+\>>> pipe = WanImageToVideoPipeline.from\_pretrained(
+...     model\_id, vae=vae, image\_encoder=image\_encoder, torch\_dtype=torch.bfloat16
+... )
+\>>> pipe.to("cuda")
+\>>> image = load\_image(
+...     "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
+... )
+\>>> max\_area = 480 \* 832
+\>>> aspect\_ratio = image.height / image.width
+\>>> mod\_value = pipe.vae\_scale\_factor\_spatial \* pipe.transformer.config.patch\_size\[1\]
+\>>> height = round(np.sqrt(max\_area \* aspect\_ratio)) // mod\_value \* mod\_value
+\>>> width = round(np.sqrt(max\_area / aspect\_ratio)) // mod\_value \* mod\_value
+\>>> image = image.resize((width, height))
+\>>> prompt = (
+...     "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in "
+...     "the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
+... )
+\>>> negative\_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
+\>>> output = pipe(
+...     image=image,
+...     prompt=prompt,
+...     negative\_prompt=negative\_prompt,
+...     height=height,
+...     width=width,
+...     num\_frames=81,
+...     guidance\_scale=5.0,
+... ).frames\[0\]
+\>>> export\_to\_video(output, "output.mp4", fps=16)
+#### encode\_prompt
+[](#diffusers.WanImageToVideoPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan_i2v.py#L228)
+( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
+Parameters
+*   [](#diffusers.WanImageToVideoPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) — prompt to be encoded
+*   [](#diffusers.WanImageToVideoPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) — The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
+*   [](#diffusers.WanImageToVideoPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) — Whether to use classifier free guidance or not.
+*   [](#diffusers.WanImageToVideoPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) — Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
+*   [](#diffusers.WanImageToVideoPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
+*   [](#diffusers.WanImageToVideoPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
+*   [](#diffusers.WanImageToVideoPipeline.encode_prompt.device)**device** — (`torch.device`, _optional_): torch device
+*   [](#diffusers.WanImageToVideoPipeline.encode_prompt.dtype)**dtype** — (`torch.dtype`, _optional_): torch dtype
+Encodes the prompt into text encoder hidden states.
+[](#diffusers.pipelines.wan.pipeline_output.WanPipelineOutput)WanPipelineOutput
+-------------------------------------------------------------------------------
+### class diffusers.pipelines.wan.pipeline\_output.WanPipelineOutput
+[](#diffusers.pipelines.wan.pipeline_output.WanPipelineOutput)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_output.py#L8)
+( frames: Tensor )
+Parameters
+*   [](#diffusers.pipelines.wan.pipeline_output.WanPipelineOutput.frames)**frames** (`torch.Tensor`, `np.ndarray`, or List\[List\[PIL.Image.Image\]\]) — List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.
+Output class for Wan pipelines.
+[< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/wan.md)
+CogVideoX
+[←Value-guided sampling](/docs/diffusers/main/en/api/pipelines/value_guided_sampling) [Wuerstchen→](/docs/diffusers/main/en/api/pipelines/wuerstchen)

finetrainers/args.py CHANGED Viewed

@@ -447,7 +447,7 @@ class BaseArgs:
         }
         training_arguments = {
-            "training_type": self.training_type,
             "seed": self.seed,
             "batch_size": self.batch_size,
             "train_steps": self.train_steps,

         }
         training_arguments = {
+            "training_type":self.training_type,
             "seed": self.seed,
             "batch_size": self.batch_size,
             "train_steps": self.train_steps,

vms/config.py CHANGED Viewed

@@ -497,7 +497,7 @@ class TrainingConfig:
         args.extend(["--flow_mode_scale", str(self.flow_mode_scale)])
         # Training arguments
-        args.extend(["--training_type", self.training_type])
         args.extend(["--seed", str(self.seed)])
         # We don't use this, because mixed precision is handled by accelerate launch, not by the training script itself.
@@ -507,7 +507,7 @@ class TrainingConfig:
         args.extend(["--train_steps", str(self.train_steps)])
         # LoRA specific arguments
-        if self.training_type == "lora":
             args.extend(["--rank", str(self.lora_rank)])
             args.extend(["--lora_alpha", str(self.lora_alpha)])
             args.extend(["--target_modules"] + self.target_modules)

         args.extend(["--flow_mode_scale", str(self.flow_mode_scale)])
         # Training arguments
+        args.extend(["--training_type",self.training_type])
         args.extend(["--seed", str(self.seed)])
         # We don't use this, because mixed precision is handled by accelerate launch, not by the training script itself.
         args.extend(["--train_steps", str(self.train_steps)])
         # LoRA specific arguments
+        ifself.training_type == "lora":
             args.extend(["--rank", str(self.lora_rank)])
             args.extend(["--lora_alpha", str(self.lora_alpha)])
             args.extend(["--target_modules"] + self.target_modules)

vms/services/__init__.py CHANGED Viewed

@@ -1,14 +1,16 @@
-from .captioner import CaptioningProgress, CaptioningService
-from .importer import ImportService
 from .monitoring import MonitoringService
-from .splitter import SplittingService
-from .trainer import TrainingService
 __all__ = [
     'CaptioningProgress',
     'CaptioningService',
-    'ImportService',
     'MonitoringService',
     'SplittingService',
     'TrainingService',
 ]

+from .captioning import CaptioningProgress, CaptioningService
+from .importing import ImportingService
 from .monitoring import MonitoringService
+from .splitting import SplittingService
+from .previewing import PreviewingService
+from .training import TrainingService
 __all__ = [
     'CaptioningProgress',
     'CaptioningService',
+    'ImportingService',
     'MonitoringService',
     'SplittingService',
+    'PreviewingService',
     'TrainingService',
 ]

vms/services/{captioner.py → captioning.py} RENAMED Viewed

File without changes

vms/services/{importer → importing}/__init__.py RENAMED Viewed

@@ -3,9 +3,9 @@ Import module for Video Model Studio.
 Handles file uploads, YouTube downloads, and Hugging Face Hub dataset integration.
 """
-from .import_service import ImportService
 from .file_upload import FileUploadHandler
 from .youtube import YouTubeDownloader
 from .hub_dataset import HubDatasetBrowser
-__all__ = ['ImportService', 'FileUploadHandler', 'YouTubeDownloader', 'HubDatasetBrowser']

 Handles file uploads, YouTube downloads, and Hugging Face Hub dataset integration.
 """
+from .import_service import ImportingService
 from .file_upload import FileUploadHandler
 from .youtube import YouTubeDownloader
 from .hub_dataset import HubDatasetBrowser
+__all__ = ['ImportingService', 'FileUploadHandler', 'YouTubeDownloader', 'HubDatasetBrowser']

vms/services/{importer → importing}/file_upload.py RENAMED Viewed

File without changes

vms/services/{importer → importing}/hub_dataset.py RENAMED Viewed

File without changes

vms/services/{importer → importing}/import_service.py RENAMED Viewed

@@ -17,7 +17,7 @@ from vms.config import HF_API_TOKEN
 logger = logging.getLogger(__name__)
-class ImportService:
     """Main service class for handling imports from various sources"""
     def __init__(self):

 logger = logging.getLogger(__name__)
+class ImportingService:
     """Main service class for handling imports from various sources"""
     def __init__(self):

vms/services/{importer → importing}/youtube.py RENAMED Viewed

File without changes

vms/services/previewing.py ADDED Viewed

	@@ -0,0 +1,406 @@

+"""
+Preview service for Video Model Studio
+Handles the video generation logic and model integration
+"""
+import logging
+import tempfile
+import torch
+from pathlib import Path
+from typing import Dict, Any, List, Optional, Tuple, Callable
+from vms.config import (
+    OUTPUT_PATH, STORAGE_PATH, MODEL_TYPES, TRAINING_PATH,
+    DEFAULT_PROMPT_PREFIX
+)
+from vms.utils import format_time
+logger = logging.getLogger(__name__)
+class PreviewingService:
+    """Handles the video generation logic and model integration"""
+    def __init__(self):
+        """Initialize the preview service"""
+        pass
+    def find_latest_lora_weights(self) -> Optional[str]:
+        """Find the latest LoRA weights file"""
+        try:
+            lora_path = OUTPUT_PATH / "pytorch_lora_weights.safetensors"
+            if lora_path.exists():
+                return str(lora_path)
+            # If not found in the expected location, try to find in checkpoints
+            checkpoints = list(OUTPUT_PATH.glob("checkpoint-*"))
+            if not checkpoints:
+                return None
+            latest_checkpoint = max(checkpoints, key=lambda x: int(x.name.split("-")[1]))
+            lora_path = latest_checkpoint / "pytorch_lora_weights.safetensors"
+            if lora_path.exists():
+                return str(lora_path)
+            return None
+        except Exception as e:
+            logger.error(f"Error finding LoRA weights: {e}")
+            return None
+    def generate_video(
+        self,
+        model_type: str,
+        prompt: str,
+        negative_prompt: str,
+        prompt_prefix: str,
+        width: int,
+        height: int,
+        num_frames: int,
+        guidance_scale: float,
+        flow_shift: float,
+        lora_weight: float,
+        inference_steps: int,
+        enable_cpu_offload: bool,
+        fps: int
+    ) -> Tuple[Optional[str], str, str]:
+        """Generate a video using the trained model"""
+        try:
+            log_messages = []
+            def log(msg: str):
+                log_messages.append(msg)
+                logger.info(msg)
+                return "\n".join(log_messages)
+            # Find latest LoRA weights
+            lora_path = self.find_latest_lora_weights()
+            if not lora_path:
+                return None, "Error: No LoRA weights found", log("Error: No LoRA weights found in output directory")
+            # Add prefix to prompt
+            if prompt_prefix and not prompt.startswith(prompt_prefix):
+                full_prompt = f"{prompt_prefix}{prompt}"
+            else:
+                full_prompt = prompt
+            # Create correct num_frames (should be 8*k + 1)
+            adjusted_num_frames = ((num_frames - 1) // 8) * 8 + 1
+            if adjusted_num_frames != num_frames:
+                log(f"Adjusted number of frames from {num_frames} to {adjusted_num_frames} to match model requirements")
+                num_frames = adjusted_num_frames
+            # Get model type (internal name)
+            internal_model_type = MODEL_TYPES.get(model_type)
+            if not internal_model_type:
+                return None, f"Error: Invalid model type {model_type}", log(f"Error: Invalid model type {model_type}")
+            log(f"Generating video with model type: {internal_model_type}")
+            log(f"Using LoRA weights from: {lora_path}")
+            log(f"Resolution: {width}x{height}, Frames: {num_frames}, FPS: {fps}")
+            log(f"Guidance Scale: {guidance_scale}, Flow Shift: {flow_shift}, LoRA Weight: {lora_weight}")
+            log(f"Prompt: {full_prompt}")
+            log(f"Negative Prompt: {negative_prompt}")
+            # Import required components based on model type
+            if internal_model_type == "wan":
+                return self.generate_wan_video(
+                    full_prompt, negative_prompt, width, height, num_frames,
+                    guidance_scale, flow_shift, lora_path, lora_weight,
+                    inference_steps, enable_cpu_offload, fps, log
+                )
+            elif internal_model_type == "ltx_video":
+                return self.generate_ltx_video(
+                    full_prompt, negative_prompt, width, height, num_frames,
+                    guidance_scale, flow_shift, lora_path, lora_weight,
+                    inference_steps, enable_cpu_offload, fps, log
+                )
+            elif internal_model_type == "hunyuan_video":
+                return self.generate_hunyuan_video(
+                    full_prompt, negative_prompt, width, height, num_frames,
+                    guidance_scale, flow_shift, lora_path, lora_weight,
+                    inference_steps, enable_cpu_offload, fps, log
+                )
+            else:
+                return None, f"Error: Unsupported model type {internal_model_type}", log(f"Error: Unsupported model type {internal_model_type}")
+        except Exception as e:
+            logger.exception("Error generating video")
+            return None, f"Error: {str(e)}", f"Exception occurred: {str(e)}"
+    def generate_wan_video(
+        self,
+        prompt: str,
+        negative_prompt: str,
+        width: int,
+        height: int,
+        num_frames: int,
+        guidance_scale: float,
+        flow_shift: float,
+        lora_path: str,
+        lora_weight: float,
+        inference_steps: int,
+        enable_cpu_offload: bool,
+        fps: int,
+        log_fn: Callable
+    ) -> Tuple[Optional[str], str, str]:
+        """Generate video using Wan model"""
+        start_time = torch.cuda.Event(enable_timing=True)
+        end_time = torch.cuda.Event(enable_timing=True)
+        try:
+            import torch
+            from diffusers import AutoencoderKLWan, WanPipeline
+            from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
+            from diffusers.utils import export_to_video
+            log_fn("Importing Wan model components...")
+            # Use the smaller model for faster inference
+            model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
+            log_fn(f"Loading VAE from {model_id}...")
+            vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
+            log_fn(f"Loading transformer from {model_id}...")
+            pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
+            log_fn(f"Configuring scheduler with flow_shift={flow_shift}...")
+            pipe.scheduler = UniPCMultistepScheduler.from_config(
+                pipe.scheduler.config,
+                flow_shift=flow_shift
+            )
+            log_fn("Moving pipeline to CUDA device...")
+            pipe.to("cuda")
+            if enable_cpu_offload:
+                log_fn("Enabling model CPU offload...")
+                pipe.enable_model_cpu_offload()
+            log_fn(f"Loading LoRA weights from {lora_path} with weight {lora_weight}...")
+            pipe.load_lora_weights(lora_path)
+            pipe.fuse_lora(lora_weight)
+            # Create temporary file for the output
+            with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as temp_file:
+                output_path = temp_file.name
+            log_fn("Starting video generation...")
+            start_time.record()
+            output = pipe(
+                prompt=prompt,
+                negative_prompt=negative_prompt,
+                height=height,
+                width=width,
+                num_frames=num_frames,
+                guidance_scale=guidance_scale,
+                num_inference_steps=inference_steps,
+            ).frames[0]
+            end_time.record()
+            torch.cuda.synchronize()
+            generation_time = start_time.elapsed_time(end_time) / 1000  # Convert to seconds
+            log_fn(f"Video generation completed in {format_time(generation_time)}")
+            log_fn(f"Exporting video to {output_path}...")
+            export_to_video(output, output_path, fps=fps)
+            log_fn("Video generation and export completed successfully!")
+            # Clean up CUDA memory
+            pipe = None
+            torch.cuda.empty_cache()
+            return output_path, "Video generated successfully!", log_fn(f"Generation completed in {format_time(generation_time)}")
+        except Exception as e:
+            log_fn(f"Error generating video with Wan: {str(e)}")
+            # Clean up CUDA memory
+            torch.cuda.empty_cache()
+            return None, f"Error: {str(e)}", log_fn(f"Exception occurred: {str(e)}")
+    def generate_ltx_video(
+        self,
+        prompt: str,
+        negative_prompt: str,
+        width: int,
+        height: int,
+        num_frames: int,
+        guidance_scale: float,
+        flow_shift: float,
+        lora_path: str,
+        lora_weight: float,
+        inference_steps: int,
+        enable_cpu_offload: bool,
+        fps: int,
+        log_fn: Callable
+    ) -> Tuple[Optional[str], str, str]:
+        """Generate video using LTX model"""
+        start_time = torch.cuda.Event(enable_timing=True)
+        end_time = torch.cuda.Event(enable_timing=True)
+        try:
+            import torch
+            from diffusers import LTXPipeline
+            from diffusers.utils import export_to_video
+            log_fn("Importing LTX model components...")
+            model_id = "Lightricks/LTX-Video"
+            log_fn(f"Loading pipeline from {model_id}...")
+            pipe = LTXPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
+            log_fn("Moving pipeline to CUDA device...")
+            pipe.to("cuda")
+            if enable_cpu_offload:
+                log_fn("Enabling model CPU offload...")
+                pipe.enable_model_cpu_offload()
+            log_fn(f"Loading LoRA weights from {lora_path} with weight {lora_weight}...")
+            pipe.load_lora_weights(lora_path)
+            pipe.fuse_lora(lora_weight)
+            # Create temporary file for the output
+            with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as temp_file:
+                output_path = temp_file.name
+            log_fn("Starting video generation...")
+            start_time.record()
+            video = pipe(
+                prompt=prompt,
+                negative_prompt=negative_prompt,
+                height=height,
+                width=width,
+                num_frames=num_frames,
+                guidance_scale=guidance_scale,
+                decode_timestep=0.03,
+                decode_noise_scale=0.025,
+                num_inference_steps=inference_steps,
+            ).frames[0]
+            end_time.record()
+            torch.cuda.synchronize()
+            generation_time = start_time.elapsed_time(end_time) / 1000  # Convert to seconds
+            log_fn(f"Video generation completed in {format_time(generation_time)}")
+            log_fn(f"Exporting video to {output_path}...")
+            export_to_video(video, output_path, fps=fps)
+            log_fn("Video generation and export completed successfully!")
+            # Clean up CUDA memory
+            pipe = None
+            torch.cuda.empty_cache()
+            return output_path, "Video generated successfully!", log_fn(f"Generation completed in {format_time(generation_time)}")
+        except Exception as e:
+            log_fn(f"Error generating video with LTX: {str(e)}")
+            # Clean up CUDA memory
+            torch.cuda.empty_cache()
+            return None, f"Error: {str(e)}", log_fn(f"Exception occurred: {str(e)}")
+    def generate_hunyuan_video(
+        self,
+        prompt: str,
+        negative_prompt: str,
+        width: int,
+        height: int,
+        num_frames: int,
+        guidance_scale: float,
+        flow_shift: float,
+        lora_path: str,
+        lora_weight: float,
+        inference_steps: int,
+        enable_cpu_offload: bool,
+        fps: int,
+        log_fn: Callable
+    ) -> Tuple[Optional[str], str, str]:
+        """Generate video using HunyuanVideo model"""
+        start_time = torch.cuda.Event(enable_timing=True)
+        end_time = torch.cuda.Event(enable_timing=True)
+        try:
+            import torch
+            from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel, AutoencoderKLHunyuanVideo
+            from diffusers.utils import export_to_video
+            log_fn("Importing HunyuanVideo model components...")
+            model_id = "hunyuanvideo-community/HunyuanVideo"
+            log_fn(f"Loading transformer from {model_id}...")
+            transformer = HunyuanVideoTransformer3DModel.from_pretrained(
+                model_id,
+                subfolder="transformer",
+                torch_dtype=torch.bfloat16
+            )
+            log_fn(f"Loading pipeline from {model_id}...")
+            pipe = HunyuanVideoPipeline.from_pretrained(
+                model_id,
+                transformer=transformer,
+                torch_dtype=torch.float16
+            )
+            log_fn("Enabling VAE tiling for better memory usage...")
+            pipe.vae.enable_tiling()
+            log_fn("Moving pipeline to CUDA device...")
+            pipe.to("cuda")
+            if enable_cpu_offload:
+                log_fn("Enabling model CPU offload...")
+                pipe.enable_model_cpu_offload()
+            log_fn(f"Loading LoRA weights from {lora_path} with weight {lora_weight}...")
+            pipe.load_lora_weights(lora_path)
+            pipe.fuse_lora(lora_weight)
+            # Create temporary file for the output
+            with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as temp_file:
+                output_path = temp_file.name
+            log_fn("Starting video generation...")
+            start_time.record()
+            output = pipe(
+                prompt=prompt,
+                negative_prompt=negative_prompt if negative_prompt else None,
+                height=height,
+                width=width,
+                num_frames=num_frames,
+                guidance_scale=guidance_scale,
+                true_cfg_scale=1.0,
+                num_inference_steps=inference_steps,
+            ).frames[0]
+            end_time.record()
+            torch.cuda.synchronize()
+            generation_time = start_time.elapsed_time(end_time) / 1000  # Convert to seconds
+            log_fn(f"Video generation completed in {format_time(generation_time)}")
+            log_fn(f"Exporting video to {output_path}...")
+            export_to_video(output, output_path, fps=fps)
+            log_fn("Video generation and export completed successfully!")
+            # Clean up CUDA memory
+            pipe = None
+            torch.cuda.empty_cache()
+            return output_path, "Video generated successfully!", log_fn(f"Generation completed in {format_time(generation_time)}")
+        except Exception as e:
+            log_fn(f"Error generating video with HunyuanVideo: {str(e)}")
+            # Clean up CUDA memory
+            torch.cuda.empty_cache()
+            return None, f"Error: {str(e)}", log_fn(f"Exception occurred: {str(e)}")

vms/services/{splitter.py → splitting.py} RENAMED Viewed

File without changes

vms/services/{trainer.py → training.py} RENAMED Viewed

File without changes

vms/tabs/caption_tab.py CHANGED Viewed

@@ -224,8 +224,8 @@ class CaptionTab(BaseTab):
             self._should_stop_captioning = True
             # Call stop method on captioner
-            if self.app.captioner:
-                self.app.captioner.stop_captioning()
             # Get updated file list
             updated_list = self.list_training_files_to_caption()
@@ -286,7 +286,7 @@ class CaptionTab(BaseTab):
             file_statuses = {}
             # Start the actual captioning process
-            async for rows in self.app.captioner.start_caption_generation(captioning_bot_instructions, prompt_prefix):
                 # Update our tracking of file statuses
                 for name, status in rows:
                     file_statuses[name] = status
@@ -516,7 +516,7 @@ class CaptionTab(BaseTab):
             # Use the original file path stored during selection instead of the temporary preview paths
             if original_file_path:
                 file_path = Path(original_file_path)
-                self.app.captioner.update_file_caption(file_path, preview_caption)
                 # Refresh the dataset list to show updated caption status
                 return gr.update(value="Caption saved successfully!")
             else:

             self._should_stop_captioning = True
             # Call stop method on captioner
+            if self.app.captioning:
+                self.app.captioning.stop_captioning()
             # Get updated file list
             updated_list = self.list_training_files_to_caption()
             file_statuses = {}
             # Start the actual captioning process
+            async for rows in self.app.captioning.start_caption_generation(captioning_bot_instructions, prompt_prefix):
                 # Update our tracking of file statuses
                 for name, status in rows:
                     file_statuses[name] = status
             # Use the original file path stored during selection instead of the temporary preview paths
             if original_file_path:
                 file_path = Path(original_file_path)
+                self.app.captioning.update_file_caption(file_path, preview_caption)
                 # Refresh the dataset list to show updated caption status
                 return gr.update(value="Caption saved successfully!")
             else:

vms/tabs/import_tab/hub_tab.py CHANGED Viewed

@@ -168,7 +168,7 @@ class HubTab(BaseTab):
         """Search datasets on the Hub matching the query"""
         try:
             logger.info(f"Searching for datasets with query: '{query}'")
-            results_full = self.app.importer.search_datasets(query)
             # Extract just the first column (dataset IDs) for display
             results = [[row[0]] for row in results_full]
@@ -199,7 +199,7 @@ class HubTab(BaseTab):
             logger.info(f"Getting dataset info for: {dataset_id}")
             # Use the importer service to get dataset info
-            info_text, file_counts, _ = self.app.importer.get_dataset_info(dataset_id)
             # Get counts of each file type
             video_count = file_counts.get("video", 0)
@@ -247,7 +247,7 @@ class HubTab(BaseTab):
                     progress_callback(fraction, desc=desc)
             # Call the actual download function with our adapter
-            result = await self.app.importer.download_file_group(
                 dataset_id,
                 file_type,
                 enable_splitting,

         """Search datasets on the Hub matching the query"""
         try:
             logger.info(f"Searching for datasets with query: '{query}'")
+            results_full = self.app.importing.search_datasets(query)
             # Extract just the first column (dataset IDs) for display
             results = [[row[0]] for row in results_full]
             logger.info(f"Getting dataset info for: {dataset_id}")
             # Use the importer service to get dataset info
+            info_text, file_counts, _ = self.app.importing.get_dataset_info(dataset_id)
             # Get counts of each file type
             video_count = file_counts.get("video", 0)
                     progress_callback(fraction, desc=desc)
             # Call the actual download function with our adapter
+            result = await self.app.importing.download_file_group(
                 dataset_id,
                 file_type,
                 enable_splitting,

vms/tabs/import_tab/import_tab.py CHANGED Viewed

@@ -89,7 +89,7 @@ class ImportTab(BaseTab):
         # If scene detection isn't already running and there are videos to process,
         # and auto-splitting is enabled, start the detection
-        if videos and not self.app.splitter.is_processing() and enable_splitting:
             # Start the scene detection in a separate thread
             self._start_scene_detection_bg(enable_splitting)
             msg = "Starting automatic scene detection..."
@@ -133,7 +133,7 @@ class ImportTab(BaseTab):
             try:
                 async def copy_files():
                     for video_file in VIDEOS_TO_SPLIT_PATH.glob("*.mp4"):
-                        await self.app.splitter.process_video(video_file, enable_splitting=False)
                 loop.run_until_complete(copy_files())
             except Exception as e:

         # If scene detection isn't already running and there are videos to process,
         # and auto-splitting is enabled, start the detection
+        if videos and not self.app.splitting.is_processing() and enable_splitting:
             # Start the scene detection in a separate thread
             self._start_scene_detection_bg(enable_splitting)
             msg = "Starting automatic scene detection..."
             try:
                 async def copy_files():
                     for video_file in VIDEOS_TO_SPLIT_PATH.glob("*.mp4"):
+                        await self.app.splitting.process_video(video_file, enable_splitting=False)
                 loop.run_until_complete(copy_files())
             except Exception as e:

vms/tabs/import_tab/upload_tab.py CHANGED Viewed

@@ -53,7 +53,7 @@ class UploadTab(BaseTab):
         """Connect event handlers to UI components"""
         # File upload event
         self.components["files"].upload(
-            fn=lambda x: self.app.importer.process_uploaded_files(x),
             inputs=[self.components["files"]],
             outputs=[self.components["import_status"]]  # This comes from parent tab
         ).success(

         """Connect event handlers to UI components"""
         # File upload event
         self.components["files"].upload(
+            fn=lambda x: self.app.importing.process_uploaded_files(x),
             inputs=[self.components["files"]],
             outputs=[self.components["import_status"]]  # This comes from parent tab
         ).success(

vms/tabs/import_tab/youtube_tab.py CHANGED Viewed

@@ -49,7 +49,7 @@ class YouTubeTab(BaseTab):
         """Connect event handlers to UI components"""
         # YouTube download event
         self.components["youtube_download_btn"].click(
-            fn=self.app.importer.download_youtube_video,
             inputs=[self.components["youtube_url"]],
             outputs=[self.components["import_status"]]  # This comes from parent tab
         ).success(

         """Connect event handlers to UI components"""
         # YouTube download event
         self.components["youtube_download_btn"].click(
+            fn=self.app.importing.download_youtube_video,
             inputs=[self.components["youtube_url"]],
             outputs=[self.components["import_status"]]  # This comes from parent tab
         ).success(

vms/tabs/manage_tab.py CHANGED Viewed

@@ -23,7 +23,7 @@ class ManageTab(BaseTab):
     def __init__(self, app_state):
         super().__init__(app_state)
         self.id = "manage_tab"
-        self.title = "6️⃣  Manage"
     def create(self, parent=None) -> gr.TabItem:
         """Create the Manage tab UI components"""
@@ -90,12 +90,12 @@ class ManageTab(BaseTab):
         # Download buttons
         self.components["download_dataset_btn"].click(
-            fn=self.app.trainer.create_training_dataset_zip,
             outputs=[self.components["download_dataset_btn"]]
         )
         self.components["download_model_btn"].click(
-            fn=self.app.trainer.get_model_output_safetensors,
             outputs=[self.components["download_model_btn"]]
         )
@@ -139,11 +139,11 @@ class ManageTab(BaseTab):
             return f"Error: {validation['error']}"
         # Check if we have a model to upload
-        if not self.app.trainer.get_model_output_safetensors():
             return "Error: No model found to upload"
         # Upload model to hub
-        success = self.app.trainer.upload_to_hub(OUTPUT_PATH, repo_id)
         if success:
             return f"Successfully uploaded model to {repo_id}"
@@ -184,25 +184,25 @@ class ManageTab(BaseTab):
         try:
             # Stop training if running
-            if self.app.trainer.is_training_running():
-                training_result = self.app.trainer.stop_training()
                 status_messages["training"] = training_result["status"]
             # Stop captioning if running
-            if self.app.captioner:
-                self.app.captioner.stop_captioning()
                 status_messages["captioning"] = "Captioning stopped"
             # Stop scene detection if running
-            if self.app.splitter.is_processing():
-                self.app.splitter.processing = False
                 status_messages["splitting"] = "Scene detection stopped"
             # Properly close logging before clearing log file
-            if self.app.trainer.file_handler:
-                self.app.trainer.file_handler.close()
-                logger.removeHandler(self.app.trainer.file_handler)
-                self.app.trainer.file_handler = None
             if LOG_FILE_PATH.exists():
                 LOG_FILE_PATH.unlink()
@@ -221,10 +221,10 @@ class ManageTab(BaseTab):
             # Reset any persistent state
             self.app.tabs["caption_tab"]._should_stop_captioning = True
-            self.app.splitter.processing = False
             # Recreate logging setup
-            self.app.trainer.setup_logging()
             return {
                 "status": "All processes stopped and data cleared",

     def __init__(self, app_state):
         super().__init__(app_state)
         self.id = "manage_tab"
+        self.title = "7️⃣  Manage"
     def create(self, parent=None) -> gr.TabItem:
         """Create the Manage tab UI components"""
         # Download buttons
         self.components["download_dataset_btn"].click(
+            fn=self.app.training.create_training_dataset_zip,
             outputs=[self.components["download_dataset_btn"]]
         )
         self.components["download_model_btn"].click(
+            fn=self.app.training.get_model_output_safetensors,
             outputs=[self.components["download_model_btn"]]
         )
             return f"Error: {validation['error']}"
         # Check if we have a model to upload
+        if not self.app.training.get_model_output_safetensors():
             return "Error: No model found to upload"
         # Upload model to hub
+        success = self.app.training.upload_to_hub(OUTPUT_PATH, repo_id)
         if success:
             return f"Successfully uploaded model to {repo_id}"
         try:
             # Stop training if running
+            if self.app.training.is_training_running():
+                training_result = self.app.training.stop_training()
                 status_messages["training"] = training_result["status"]
             # Stop captioning if running
+            if self.app.captioning:
+                self.app.captioning.stop_captioning()
                 status_messages["captioning"] = "Captioning stopped"
             # Stop scene detection if running
+            if self.app.splitting.is_processing():
+                self.app.splitting.processing = False
                 status_messages["splitting"] = "Scene detection stopped"
             # Properly close logging before clearing log file
+            if self.app.training.file_handler:
+                self.app.training.file_handler.close()
+                logger.removeHandler(self.app.training.file_handler)
+                self.app.training.file_handler = None
             if LOG_FILE_PATH.exists():
                 LOG_FILE_PATH.unlink()
             # Reset any persistent state
             self.app.tabs["caption_tab"]._should_stop_captioning = True
+            self.app.splitting.processing = False
             # Recreate logging setup
+            self.app.training.setup_logging()
             return {
                 "status": "All processes stopped and data cleared",

vms/tabs/monitor_tab.py CHANGED Viewed

@@ -140,8 +140,8 @@ class MonitorTab(BaseTab):
     def on_enter(self):
         """Called when the tab is selected"""
         # Start monitoring service if not already running
-        if not self.app.monitor.is_running:
-            self.app.monitor.start_monitoring()
         # Trigger initial refresh
         return self.refresh_all()
@@ -178,7 +178,7 @@ class MonitorTab(BaseTab):
         """
         try:
             # Get system info
-            system_info = self.app.monitor.get_system_info()
             # Split system info into separate components
             system_info_html = self.format_system_info(system_info)
@@ -187,13 +187,13 @@ class MonitorTab(BaseTab):
             storage_info_html = self.format_storage_info()
             # Get current metrics
-            # current_metrics = self.app.monitor.get_current_metrics()
             metrics_html = "" # self.format_current_metrics(current_metrics)
             # Generate plots
-            cpu_plot = self.app.monitor.generate_cpu_plot()
-            memory_plot = self.app.monitor.generate_memory_plot()
-            #per_core_plot = self.app.monitor.generate_per_core_plot()
             return (
                 system_info_html,

     def on_enter(self):
         """Called when the tab is selected"""
         # Start monitoring service if not already running
+        if not self.app.monitoring.is_running:
+            self.app.monitoring.start_monitoring()
         # Trigger initial refresh
         return self.refresh_all()
         """
         try:
             # Get system info
+            system_info = self.app.monitoring.get_system_info()
             # Split system info into separate components
             system_info_html = self.format_system_info(system_info)
             storage_info_html = self.format_storage_info()
             # Get current metrics
+            # current_metrics = self.app.monitoring.get_current_metrics()
             metrics_html = "" # self.format_current_metrics(current_metrics)
             # Generate plots
+            cpu_plot = self.app.monitoring.generate_cpu_plot()
+            memory_plot = self.app.monitoring.generate_memory_plot()
+            #per_core_plot = self.app.monitoring.generate_per_core_plot()
             return (
                 system_info_html,

vms/tabs/preview_tab.py ADDED Viewed

	@@ -0,0 +1,240 @@

+"""
+Preview tab for Video Model Studio UI
+"""
+import gradio as gr
+import logging
+from pathlib import Path
+from typing import Dict, Any, List, Optional, Tuple
+from vms.services.base_tab import BaseTab
+from vms.config import (
+    MODEL_TYPES, DEFAULT_PROMPT_PREFIX
+)
+logger = logging.getLogger(__name__)
+class PreviewTab(BaseTab):
+    """Preview tab for testing trained models"""
+    def __init__(self, app_state):
+        super().__init__(app_state)
+        self.id = "preview_tab"
+        self.title = "6️⃣  Preview"
+        # Get reference to the preview service from app_state
+        self.previewing_service = app_state.previewing
+    def create(self, parent=None) -> gr.TabItem:
+        """Create the Preview tab UI components"""
+        with gr.TabItem(self.title, id=self.id) as tab:
+            with gr.Row():
+                gr.Markdown("## Test Your Trained Model")
+            with gr.Row():
+                with gr.Column(scale=2):
+                    self.components["prompt"] = gr.Textbox(
+                        label="Prompt",
+                        placeholder="Enter your prompt here...",
+                        lines=3
+                    )
+                    self.components["negative_prompt"] = gr.Textbox(
+                        label="Negative Prompt",
+                        placeholder="Enter negative prompt here...",
+                        lines=3,
+                        value="worst quality, low quality, blurry, jittery, distorted, ugly, deformed, disfigured, messy background"
+                    )
+                    self.components["prompt_prefix"] = gr.Textbox(
+                        label="Global Prompt Prefix",
+                        placeholder="Prefix to add to all prompts",
+                        value=DEFAULT_PROMPT_PREFIX
+                    )
+                    with gr.Row():
+                        self.components["model_type"] = gr.Dropdown(
+                            choices=list(MODEL_TYPES.keys()),
+                            label="Model Type",
+                            value=list(MODEL_TYPES.keys())[0]
+                        )
+                        self.components["resolution_preset"] = gr.Dropdown(
+                            choices=["480p", "720p"],
+                            label="Resolution Preset",
+                            value="480p"
+                        )
+                    with gr.Row():
+                        self.components["width"] = gr.Number(
+                            label="Width",
+                            value=832,
+                            precision=0
+                        )
+                        self.components["height"] = gr.Number(
+                            label="Height",
+                            value=480,
+                            precision=0
+                        )
+                    with gr.Row():
+                        self.components["num_frames"] = gr.Slider(
+                            label="Number of Frames",
+                            minimum=1,
+                            maximum=257,
+                            step=8,
+                            value=49
+                        )
+                        self.components["fps"] = gr.Slider(
+                            label="FPS",
+                            minimum=1,
+                            maximum=60,
+                            step=1,
+                            value=16
+                        )
+                    with gr.Row():
+                        self.components["guidance_scale"] = gr.Slider(
+                            label="Guidance Scale",
+                            minimum=1.0,
+                            maximum=10.0,
+                            step=0.1,
+                            value=5.0
+                        )
+                        self.components["flow_shift"] = gr.Slider(
+                            label="Flow Shift",
+                            minimum=0.0,
+                            maximum=10.0,
+                            step=0.1,
+                            value=3.0
+                        )
+                    with gr.Row():
+                        self.components["lora_weight"] = gr.Slider(
+                            label="LoRA Weight",
+                            minimum=0.0,
+                            maximum=1.0,
+                            step=0.01,
+                            value=0.7
+                        )
+                        self.components["inference_steps"] = gr.Slider(
+                            label="Inference Steps",
+                            minimum=1,
+                            maximum=100,
+                            step=1,
+                            value=30
+                        )
+                    self.components["enable_cpu_offload"] = gr.Checkbox(
+                        label="Enable Model CPU Offload (for low-VRAM GPUs)",
+                        value=True
+                    )
+                    self.components["generate_btn"] = gr.Button(
+                        "Generate Video",
+                        variant="primary"
+                    )
+                with gr.Column(scale=3):
+                    self.components["preview_video"] = gr.Video(
+                        label="Generated Video",
+                        interactive=False
+                    )
+                    self.components["status"] = gr.Textbox(
+                        label="Status",
+                        interactive=False
+                    )
+                    with gr.Accordion("Log", open=False):
+                        self.components["log"] = gr.TextArea(
+                            label="Generation Log",
+                            interactive=False,
+                            lines=10
+                        )
+        return tab
+    def connect_events(self) -> None:
+        """Connect event handlers to UI components"""
+        # Update resolution when preset changes
+        self.components["resolution_preset"].change(
+            fn=self.update_resolution,
+            inputs=[self.components["resolution_preset"]],
+            outputs=[
+                self.components["width"],
+                self.components["height"],
+                self.components["flow_shift"]
+            ]
+        )
+        # Generate button click
+        self.components["generate_btn"].click(
+            fn=self.generate_video,
+            inputs=[
+                self.components["model_type"],
+                self.components["prompt"],
+                self.components["negative_prompt"],
+                self.components["prompt_prefix"],
+                self.components["width"],
+                self.components["height"],
+                self.components["num_frames"],
+                self.components["guidance_scale"],
+                self.components["flow_shift"],
+                self.components["lora_weight"],
+                self.components["inference_steps"],
+                self.components["enable_cpu_offload"],
+                self.components["fps"]
+            ],
+            outputs=[
+                self.components["preview_video"],
+                self.components["status"],
+                self.components["log"]
+            ]
+        )
+    def update_resolution(self, preset: str) -> Tuple[int, int, float]:
+        """Update resolution and flow shift based on preset"""
+        if preset == "480p":
+            return 832, 480, 3.0
+        elif preset == "720p":
+            return 1280, 720, 5.0
+        else:
+            return 832, 480, 3.0
+    def generate_video(
+        self,
+        model_type: str,
+        prompt: str,
+        negative_prompt: str,
+        prompt_prefix: str,
+        width: int,
+        height: int,
+        num_frames: int,
+        guidance_scale: float,
+        flow_shift: float,
+        lora_weight: float,
+        inference_steps: int,
+        enable_cpu_offload: bool,
+        fps: int
+    ) -> Tuple[Optional[str], str, str]:
+        """Handler for generate button click, delegates to preview service"""
+        return self.preview_service.generate_video(
+            model_type=model_type,
+            prompt=prompt,
+            negative_prompt=negative_prompt,
+            prompt_prefix=prompt_prefix,
+            width=width,
+            height=height,
+            num_frames=num_frames,
+            guidance_scale=guidance_scale,
+            flow_shift=flow_shift,
+            lora_weight=lora_weight,
+            inference_steps=inference_steps,
+            enable_cpu_offload=enable_cpu_offload,
+            fps=fps
+        )

vms/tabs/split_tab.py CHANGED Viewed

@@ -57,7 +57,7 @@ class SplitTab(BaseTab):
     def list_unprocessed_videos(self) -> gr.Dataframe:
         """Update list of unprocessed videos"""
-        videos = self.app.splitter.list_unprocessed_videos()
         # videos is already in [[name, status]] format from splitting_service
         return gr.Dataframe(
             headers=["name", "status"],
@@ -71,11 +71,11 @@ class SplitTab(BaseTab):
         Args:
             enable_splitting: Whether to split videos into scenes
         """
-        if self.app.splitter.is_processing():
             return "Scene detection already running"
         try:
-            await self.app.splitter.start_processing(enable_splitting)
             return "Scene detection completed"
         except Exception as e:
             return f"Error during scene detection: {str(e)}"

     def list_unprocessed_videos(self) -> gr.Dataframe:
         """Update list of unprocessed videos"""
+        videos = self.app.splitting.list_unprocessed_videos()
         # videos is already in [[name, status]] format from splitting_service
         return gr.Dataframe(
             headers=["name", "status"],
         Args:
             enable_splitting: Whether to split videos into scenes
         """
+        if self.app.splitting.is_processing():
             return "Scene detection already running"
         try:
+            await self.app.splitting.start_processing(enable_splitting)
             return "Scene detection completed"
         except Exception as e:
             return f"Error during scene detection: {str(e)}"

vms/tabs/train_tab.py CHANGED Viewed

@@ -380,7 +380,7 @@ class TrainTab(BaseTab):
         # Add an event handler for delete_checkpoints_btn
         self.components["delete_checkpoints_btn"].click(
-            fn=lambda: self.app.trainer.delete_all_checkpoints(),
             outputs=[self.components["status_box"]]
         )
@@ -437,7 +437,7 @@ class TrainTab(BaseTab):
         # Start training (it will automatically use the checkpoint if provided)
         try:
-            return self.app.trainer.start_training(
                 model_internal_type,
                 lora_rank,
                 lora_alpha,
@@ -620,13 +620,13 @@ class TrainTab(BaseTab):
     def get_latest_status_message_and_logs(self) -> Tuple[str, str, str]:
         """Get latest status message, log content, and status code in a safer way"""
-        state = self.app.trainer.get_status()
-        logs = self.app.trainer.get_logs()
         # Check if training process died unexpectedly
         training_died = False
-        if state["status"] == "training" and not self.app.trainer.is_training_running():
             state["status"] = "error"
             state["message"] = "Training process terminated unexpectedly."
             training_died = True
@@ -769,16 +769,16 @@ class TrainTab(BaseTab):
         status, _, _ = self.get_latest_status_message_and_logs()
         if status == "paused":
-            self.app.trainer.resume_training()
         else:
-            self.app.trainer.pause_training()
         # Return the updates separately for text and buttons
         return (*self.get_status_updates(), *self.get_button_updates())
     def handle_stop(self):
         """Handle stop button click"""
-        self.app.trainer.stop_training()
         # Return the updates separately for text and buttons
         return (*self.get_status_updates(), *self.get_button_updates())

         # Add an event handler for delete_checkpoints_btn
         self.components["delete_checkpoints_btn"].click(
+            fn=lambda: self.app.training.delete_all_checkpoints(),
             outputs=[self.components["status_box"]]
         )
         # Start training (it will automatically use the checkpoint if provided)
         try:
+            return self.app.training.start_training(
                 model_internal_type,
                 lora_rank,
                 lora_alpha,
     def get_latest_status_message_and_logs(self) -> Tuple[str, str, str]:
         """Get latest status message, log content, and status code in a safer way"""
+        state = self.app.training.get_status()
+        logs = self.app.training.get_logs()
         # Check if training process died unexpectedly
         training_died = False
+        if state["status"] == "training" and not self.app.training.is_training_running():
             state["status"] = "error"
             state["message"] = "Training process terminated unexpectedly."
             training_died = True
         status, _, _ = self.get_latest_status_message_and_logs()
         if status == "paused":
+            self.app.training.resume_training()
         else:
+            self.app.training.pause_training()
         # Return the updates separately for text and buttons
         return (*self.get_status_updates(), *self.get_button_updates())
     def handle_stop(self):
         """Handle stop button click"""
+        self.app.training.stop_training()
         # Return the updates separately for text and buttons
         return (*self.get_status_updates(), *self.get_button_updates())

vms/ui/video_trainer_ui.py CHANGED Viewed

@@ -5,7 +5,7 @@ import logging
 import asyncio
 from typing import Any, Optional, Dict, List, Union, Tuple
-from ..services import TrainingService, CaptioningService, SplittingService, ImportService, MonitoringService
 from ..config import (
     STORAGE_PATH, VIDEOS_TO_SPLIT_PATH, STAGING_PATH, OUTPUT_PATH,
     TRAINING_PATH, LOG_FILE_PATH, TRAINING_PRESETS, TRAINING_VIDEOS_PATH, MODEL_PATH, OUTPUT_PATH,
@@ -40,17 +40,18 @@ class VideoTrainerUI:
     def __init__(self):
         """Initialize services and tabs"""
         # Initialize core services
-        self.trainer = TrainingService(self)
-        self.splitter = SplittingService()
-        self.importer = ImportService()
-        self.captioner = CaptioningService()
-        self.monitor = MonitoringService()
         # Start the monitoring service on app creation
-        self.monitor.start_monitoring()
         # Recovery status from any interrupted training
-        recovery_result = self.trainer.recover_interrupted_training()
         # Add null check for recovery_result
         if recovery_result is None:
             recovery_result = {"status": "unknown", "ui_updates": {}}
@@ -267,7 +268,7 @@ class VideoTrainerUI:
             if ui_state:
                 current_state = self.load_ui_values()
                 current_state.update(ui_state)
-                self.trainer.save_ui_state(current_state)
                 logger.info(f"Updated UI state from recovery: {ui_state}")
         # Load values (potentially with recovery updates applied)
@@ -384,15 +385,15 @@ class VideoTrainerUI:
     def update_ui_state(self, **kwargs):
         """Update UI state with new values"""
-        current_state = self.trainer.load_ui_state()
         current_state.update(kwargs)
-        self.trainer.save_ui_state(current_state)
         # Don't return anything to avoid Gradio warnings
         return None
     def load_ui_values(self):
         """Load UI state values for initializing form fields"""
-        ui_state = self.trainer.load_ui_state()
         # Ensure proper type conversion for numeric values
         ui_state["lora_rank"] = ui_state.get("lora_rank", DEFAULT_LORA_RANK_STR)
@@ -407,7 +408,7 @@ class VideoTrainerUI:
     # Add this new method to get initial button states:
     def get_initial_button_states(self):
         """Get the initial states for training buttons based on recovery status"""
-        recovery_result = self.state.get("recovery_result") or self.trainer.recover_interrupted_training()
         ui_updates = recovery_result.get("ui_updates", {})
         # Check for checkpoints to determine start button text
@@ -415,7 +416,7 @@ class VideoTrainerUI:
         # Default button states if recovery didn't provide any
         if not ui_updates or not ui_updates.get("start_btn"):
-            is_training = self.trainer.is_training_running()
             if is_training:
                 # Active training detected

 import asyncio
 from typing import Any, Optional, Dict, List, Union, Tuple
+from ..services import TrainingService, CaptioningService, SplittingService, ImportingService, PreviewingService, MonitoringService
 from ..config import (
     STORAGE_PATH, VIDEOS_TO_SPLIT_PATH, STAGING_PATH, OUTPUT_PATH,
     TRAINING_PATH, LOG_FILE_PATH, TRAINING_PRESETS, TRAINING_VIDEOS_PATH, MODEL_PATH, OUTPUT_PATH,
     def __init__(self):
         """Initialize services and tabs"""
         # Initialize core services
+        self.training = TrainingService(self)
+        self.splitting = SplittingService()
+        self.importing = ImportingService()
+        self.captioning = CaptioningService()
+        self.monitoring = MonitoringService()
+        self.previewing = PreviewingService()
         # Start the monitoring service on app creation
+        self.monitoring.start_monitoring()
         # Recovery status from any interrupted training
+        recovery_result = self.training.recover_interrupted_training()
         # Add null check for recovery_result
         if recovery_result is None:
             recovery_result = {"status": "unknown", "ui_updates": {}}
             if ui_state:
                 current_state = self.load_ui_values()
                 current_state.update(ui_state)
+                self.training.save_ui_state(current_state)
                 logger.info(f"Updated UI state from recovery: {ui_state}")
         # Load values (potentially with recovery updates applied)
     def update_ui_state(self, **kwargs):
         """Update UI state with new values"""
+        current_state = self.training.load_ui_state()
         current_state.update(kwargs)
+        self.training.save_ui_state(current_state)
         # Don't return anything to avoid Gradio warnings
         return None
     def load_ui_values(self):
         """Load UI state values for initializing form fields"""
+        ui_state = self.training.load_ui_state()
         # Ensure proper type conversion for numeric values
         ui_state["lora_rank"] = ui_state.get("lora_rank", DEFAULT_LORA_RANK_STR)
     # Add this new method to get initial button states:
     def get_initial_button_states(self):
         """Get the initial states for training buttons based on recovery status"""
+        recovery_result = self.state.get("recovery_result") or self.training.recover_interrupted_training()
         ui_updates = recovery_result.get("ui_updates", {})
         # Check for checkpoints to determine start button text
         # Default button states if recovery didn't provide any
         if not ui_updates or not ui_updates.get("start_btn"):
+            is_training = self.training.is_training_running()
             if is_training:
                 # Active training detected