jbilcke-hf HF staff commited on
Commit
c8cb798
Β·
1 Parent(s): 98d3630

working on the new Preview tab

Browse files
docs/diffusers/Load schedulers and models in Diffusers.md ADDED
@@ -0,0 +1,199 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [](#load-schedulers-and-models)Load schedulers and models
2
+ =========================================================
3
+
4
+ ![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)
5
+
6
+ ![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)
7
+
8
+ Diffusion pipelines are a collection of interchangeable schedulers and models that can be mixed and matched to tailor a pipeline to a specific use case. The scheduler encapsulates the entire denoising process such as the number of denoising steps and the algorithm for finding the denoised sample. A scheduler is not parameterized or trained so they don’t take very much memory. The model is usually only concerned with the forward pass of going from a noisy input to a less noisy sample.
9
+
10
+ This guide will show you how to load schedulers and models to customize a pipeline. You’ll use the [stable-diffusion-v1-5/stable-diffusion-v1-5](https://hf.co/stable-diffusion-v1-5/stable-diffusion-v1-5) checkpoint throughout this guide, so let’s load it first.
11
+
12
+ Copied
13
+
14
+ import torch
15
+ from diffusers import DiffusionPipeline
16
+
17
+ pipeline = DiffusionPipeline.from\_pretrained(
18
+ "stable-diffusion-v1-5/stable-diffusion-v1-5", torch\_dtype=torch.float16, use\_safetensors=True
19
+ ).to("cuda")
20
+
21
+ You can see what scheduler this pipeline uses with the `pipeline.scheduler` attribute.
22
+
23
+ Copied
24
+
25
+ pipeline.scheduler
26
+ PNDMScheduler {
27
+ "\_class\_name": "PNDMScheduler",
28
+ "\_diffusers\_version": "0.21.4",
29
+ "beta\_end": 0.012,
30
+ "beta\_schedule": "scaled\_linear",
31
+ "beta\_start": 0.00085,
32
+ "clip\_sample": false,
33
+ "num\_train\_timesteps": 1000,
34
+ "set\_alpha\_to\_one": false,
35
+ "skip\_prk\_steps": true,
36
+ "steps\_offset": 1,
37
+ "timestep\_spacing": "leading",
38
+ "trained\_betas": null
39
+ }
40
+
41
+ [](#load-a-scheduler)Load a scheduler
42
+ -------------------------------------
43
+
44
+ Schedulers are defined by a configuration file that can be used by a variety of schedulers. Load a scheduler with the [SchedulerMixin.from\_pretrained()](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin.from_pretrained) method, and specify the `subfolder` parameter to load the configuration file into the correct subfolder of the pipeline repository.
45
+
46
+ For example, to load the [DDIMScheduler](/docs/diffusers/main/en/api/schedulers/ddim#diffusers.DDIMScheduler):
47
+
48
+ Copied
49
+
50
+ from diffusers import DDIMScheduler, DiffusionPipeline
51
+
52
+ ddim = DDIMScheduler.from\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="scheduler")
53
+
54
+ Then you can pass the newly loaded scheduler to the pipeline.
55
+
56
+ Copied
57
+
58
+ pipeline = DiffusionPipeline.from\_pretrained(
59
+ "stable-diffusion-v1-5/stable-diffusion-v1-5", scheduler=ddim, torch\_dtype=torch.float16, use\_safetensors=True
60
+ ).to("cuda")
61
+
62
+ [](#compare-schedulers)Compare schedulers
63
+ -----------------------------------------
64
+
65
+ Schedulers have their own unique strengths and weaknesses, making it difficult to quantitatively compare which scheduler works best for a pipeline. You typically have to make a trade-off between denoising speed and denoising quality. We recommend trying out different schedulers to find one that works best for your use case. Call the `pipeline.scheduler.compatibles` attribute to see what schedulers are compatible with a pipeline.
66
+
67
+ Let’s compare the [LMSDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/lms_discrete#diffusers.LMSDiscreteScheduler), [EulerDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/euler#diffusers.EulerDiscreteScheduler), [EulerAncestralDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/euler_ancestral#diffusers.EulerAncestralDiscreteScheduler), and the [DPMSolverMultistepScheduler](/docs/diffusers/main/en/api/schedulers/multistep_dpm_solver#diffusers.DPMSolverMultistepScheduler) on the following prompt and seed.
68
+
69
+ Copied
70
+
71
+ import torch
72
+ from diffusers import DiffusionPipeline
73
+
74
+ pipeline = DiffusionPipeline.from\_pretrained(
75
+ "stable-diffusion-v1-5/stable-diffusion-v1-5", torch\_dtype=torch.float16, use\_safetensors=True
76
+ ).to("cuda")
77
+
78
+ prompt = "A photograph of an astronaut riding a horse on Mars, high resolution, high definition."
79
+ generator = torch.Generator(device="cuda").manual\_seed(8)
80
+
81
+ To change the pipelines scheduler, use the [from\_config()](/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method to load a different scheduler’s `pipeline.scheduler.config` into the pipeline.
82
+
83
+ LMSDiscreteScheduler
84
+
85
+ EulerDiscreteScheduler
86
+
87
+ EulerAncestralDiscreteScheduler
88
+
89
+ DPMSolverMultistepScheduler
90
+
91
+ [LMSDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/lms_discrete#diffusers.LMSDiscreteScheduler) typically generates higher quality images than the default scheduler.
92
+
93
+ Copied
94
+
95
+ from diffusers import LMSDiscreteScheduler
96
+
97
+ pipeline.scheduler = LMSDiscreteScheduler.from\_config(pipeline.scheduler.config)
98
+ image = pipeline(prompt, generator=generator).images\[0\]
99
+ image
100
+
101
+ ![](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/diffusers_docs/astronaut_lms.png)
102
+
103
+ LMSDiscreteScheduler
104
+
105
+ ![](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/diffusers_docs/astronaut_euler_discrete.png)
106
+
107
+ EulerDiscreteScheduler
108
+
109
+ ![](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/diffusers_docs/astronaut_euler_ancestral.png)
110
+
111
+ EulerAncestralDiscreteScheduler
112
+
113
+ ![](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/diffusers_docs/astronaut_dpm.png)
114
+
115
+ DPMSolverMultistepScheduler
116
+
117
+ Most images look very similar and are comparable in quality. Again, it often comes down to your specific use case so a good approach is to run multiple different schedulers and compare the results.
118
+
119
+ ### [](#flax-schedulers)Flax schedulers
120
+
121
+ To compare Flax schedulers, you need to additionally load the scheduler state into the model parameters. For example, let’s change the default scheduler in [FlaxStableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.FlaxStableDiffusionPipeline) to use the super fast `FlaxDPMSolverMultistepScheduler`.
122
+
123
+ The `FlaxLMSDiscreteScheduler` and `FlaxDDPMScheduler` are not compatible with the [FlaxStableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.FlaxStableDiffusionPipeline) yet.
124
+
125
+ Copied
126
+
127
+ import jax
128
+ import numpy as np
129
+ from flax.jax\_utils import replicate
130
+ from flax.training.common\_utils import shard
131
+ from diffusers import FlaxStableDiffusionPipeline, FlaxDPMSolverMultistepScheduler
132
+
133
+ scheduler, scheduler\_state = FlaxDPMSolverMultistepScheduler.from\_pretrained(
134
+ "stable-diffusion-v1-5/stable-diffusion-v1-5",
135
+ subfolder="scheduler"
136
+ )
137
+ pipeline, params = FlaxStableDiffusionPipeline.from\_pretrained(
138
+ "stable-diffusion-v1-5/stable-diffusion-v1-5",
139
+ scheduler=scheduler,
140
+ variant="bf16",
141
+ dtype=jax.numpy.bfloat16,
142
+ )
143
+ params\["scheduler"\] = scheduler\_state
144
+
145
+ Then you can take advantage of Flax’s compatibility with TPUs to generate a number of images in parallel. You’ll need to make a copy of the model parameters for each available device and then split the inputs across them to generate your desired number of images.
146
+
147
+ Copied
148
+
149
+ \# Generate 1 image per parallel device (8 on TPUv2-8 or TPUv3-8)
150
+ prompt = "A photograph of an astronaut riding a horse on Mars, high resolution, high definition."
151
+ num\_samples = jax.device\_count()
152
+ prompt\_ids = pipeline.prepare\_inputs(\[prompt\] \* num\_samples)
153
+
154
+ prng\_seed = jax.random.PRNGKey(0)
155
+ num\_inference\_steps = 25
156
+
157
+ \# shard inputs and rng
158
+ params = replicate(params)
159
+ prng\_seed = jax.random.split(prng\_seed, jax.device\_count())
160
+ prompt\_ids = shard(prompt\_ids)
161
+
162
+ images = pipeline(prompt\_ids, params, prng\_seed, num\_inference\_steps, jit=True).images
163
+ images = pipeline.numpy\_to\_pil(np.asarray(images.reshape((num\_samples,) + images.shape\[-3:\])))
164
+
165
+ [](#models)Models
166
+ -----------------
167
+
168
+ Models are loaded from the [ModelMixin.from\_pretrained()](/docs/diffusers/main/en/api/models/overview#diffusers.ModelMixin.from_pretrained) method, which downloads and caches the latest version of the model weights and configurations. If the latest files are available in the local cache, [from\_pretrained()](/docs/diffusers/main/en/api/models/overview#diffusers.ModelMixin.from_pretrained) reuses files in the cache instead of re-downloading them.
169
+
170
+ Models can be loaded from a subfolder with the `subfolder` argument. For example, the model weights for [stable-diffusion-v1-5/stable-diffusion-v1-5](https://hf.co/stable-diffusion-v1-5/stable-diffusion-v1-5) are stored in the [unet](https://hf.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main/unet) subfolder.
171
+
172
+ Copied
173
+
174
+ from diffusers import UNet2DConditionModel
175
+
176
+ unet = UNet2DConditionModel.from\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="unet", use\_safetensors=True)
177
+
178
+ They can also be directly loaded from a [repository](https://huggingface.co/google/ddpm-cifar10-32/tree/main).
179
+
180
+ Copied
181
+
182
+ from diffusers import UNet2DModel
183
+
184
+ unet = UNet2DModel.from\_pretrained("google/ddpm-cifar10-32", use\_safetensors=True)
185
+
186
+ To load and save model variants, specify the `variant` argument in [ModelMixin.from\_pretrained()](/docs/diffusers/main/en/api/models/overview#diffusers.ModelMixin.from_pretrained) and [ModelMixin.save\_pretrained()](/docs/diffusers/main/en/api/models/overview#diffusers.ModelMixin.save_pretrained).
187
+
188
+ Copied
189
+
190
+ from diffusers import UNet2DConditionModel
191
+
192
+ unet = UNet2DConditionModel.from\_pretrained(
193
+ "stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="unet", variant="non\_ema", use\_safetensors=True
194
+ )
195
+ unet.save\_pretrained("./local-unet", variant="non\_ema")
196
+
197
+ [< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/using-diffusers/schedulers.md)
198
+
199
+ [←Load community pipelines and components](/docs/diffusers/main/en/using-diffusers/custom_pipeline_overview) [Model files and layoutsβ†’](/docs/diffusers/main/en/using-diffusers/other-formats)
docs/diffusers/Loading pipelines in Diffusers.md ADDED
@@ -0,0 +1,528 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [](#load-pipelines)Load pipelines
2
+ =================================
3
+
4
+ ![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)
5
+
6
+ ![Open In Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)
7
+
8
+ Diffusion systems consist of multiple components like parameterized models and schedulers that interact in complex ways. That is why we designed the [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) to wrap the complexity of the entire diffusion system into an easy-to-use API. At the same time, the [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) is entirely customizable so you can modify each component to build a diffusion system for your use case.
9
+
10
+ This guide will show you how to load:
11
+
12
+ * pipelines from the Hub and locally
13
+ * different components into a pipeline
14
+ * multiple pipelines without increasing memory usage
15
+ * checkpoint variants such as different floating point types or non-exponential mean averaged (EMA) weights
16
+
17
+ [](#load-a-pipeline)Load a pipeline
18
+ -----------------------------------
19
+
20
+ Skip to the [DiffusionPipeline explained](#diffusionpipeline-explained) section if you’re interested in an explanation about how the [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) class works.
21
+
22
+ There are two ways to load a pipeline for a task:
23
+
24
+ 1. Load the generic [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) class and allow it to automatically detect the correct pipeline class from the checkpoint.
25
+ 2. Load a specific pipeline class for a specific task.
26
+
27
+ generic pipeline
28
+
29
+ specific pipeline
30
+
31
+ The [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) class is a simple and generic way to load the latest trending diffusion model from the [Hub](https://huggingface.co/models?library=diffusers&sort=trending). It uses the [from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) method to automatically detect the correct pipeline class for a task from the checkpoint, downloads and caches all the required configuration and weight files, and returns a pipeline ready for inference.
32
+
33
+ Copied
34
+
35
+ from diffusers import DiffusionPipeline
36
+
37
+ pipeline = DiffusionPipeline.from\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use\_safetensors=True)
38
+
39
+ This same checkpoint can also be used for an image-to-image task. The [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline) class can handle any task as long as you provide the appropriate inputs. For example, for an image-to-image task, you need to pass an initial image to the pipeline.
40
+
41
+ Copied
42
+
43
+ from diffusers import DiffusionPipeline
44
+
45
+ pipeline = DiffusionPipeline.from\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", use\_safetensors=True)
46
+
47
+ init\_image = load\_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png")
48
+ prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
49
+ image = pipeline("Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", image=init\_image).images\[0\]
50
+
51
+ Use the Space below to gauge a pipeline’s memory requirements before you download and load it to see if it runs on your hardware.
52
+
53
+ ### [](#local-pipeline)Local pipeline
54
+
55
+ To load a pipeline locally, use [git-lfs](https://git-lfs.github.com/) to manually download a checkpoint to your local disk.
56
+
57
+ Copied
58
+
59
+ git-lfs install
60
+ git clone https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5
61
+
62
+ This creates a local folder, ./stable-diffusion-v1-5, on your disk and you should pass its path to [from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained).
63
+
64
+ Copied
65
+
66
+ from diffusers import DiffusionPipeline
67
+
68
+ stable\_diffusion = DiffusionPipeline.from\_pretrained("./stable-diffusion-v1-5", use\_safetensors=True)
69
+
70
+ The [from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) method won’t download files from the Hub when it detects a local path, but this also means it won’t download and cache the latest changes to a checkpoint.
71
+
72
+ [](#customize-a-pipeline)Customize a pipeline
73
+ ---------------------------------------------
74
+
75
+ You can customize a pipeline by loading different components into it. This is important because you can:
76
+
77
+ * change to a scheduler with faster generation speed or higher generation quality depending on your needs (call the `scheduler.compatibles` method on your pipeline to see compatible schedulers)
78
+ * change a default pipeline component to a newer and better performing one
79
+
80
+ For example, let’s customize the default [stabilityai/stable-diffusion-xl-base-1.0](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0) checkpoint with:
81
+
82
+ * The [HeunDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/heun#diffusers.HeunDiscreteScheduler) to generate higher quality images at the expense of slower generation speed. You must pass the `subfolder="scheduler"` parameter in [from\_pretrained()](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin.from_pretrained) to load the scheduler configuration into the correct [subfolder](https://hf.co/stabilityai/stable-diffusion-xl-base-1.0/tree/main/scheduler) of the pipeline repository.
83
+ * A more stable VAE that runs in fp16.
84
+
85
+ Copied
86
+
87
+ from diffusers import StableDiffusionXLPipeline, HeunDiscreteScheduler, AutoencoderKL
88
+ import torch
89
+
90
+ scheduler = HeunDiscreteScheduler.from\_pretrained("stabilityai/stable-diffusion-xl-base-1.0", subfolder="scheduler")
91
+ vae = AutoencoderKL.from\_pretrained("madebyollin/sdxl-vae-fp16-fix", torch\_dtype=torch.float16, use\_safetensors=True)
92
+
93
+ Now pass the new scheduler and VAE to the [StableDiffusionXLPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline).
94
+
95
+ Copied
96
+
97
+ pipeline = StableDiffusionXLPipeline.from\_pretrained(
98
+ "stabilityai/stable-diffusion-xl-base-1.0",
99
+ scheduler=scheduler,
100
+ vae=vae,
101
+ torch\_dtype=torch.float16,
102
+ variant="fp16",
103
+ use\_safetensors=True
104
+ ).to("cuda")
105
+
106
+ [](#reuse-a-pipeline)Reuse a pipeline
107
+ -------------------------------------
108
+
109
+ When you load multiple pipelines that share the same model components, it makes sense to reuse the shared components instead of reloading everything into memory again, especially if your hardware is memory-constrained. For example:
110
+
111
+ 1. You generated an image with the [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) but you want to improve its quality with the [StableDiffusionSAGPipeline](/docs/diffusers/main/en/api/pipelines/self_attention_guidance#diffusers.StableDiffusionSAGPipeline). Both of these pipelines share the same pretrained model, so it’d be a waste of memory to load the same model twice.
112
+ 2. You want to add a model component, like a [`MotionAdapter`](../api/pipelines/animatediff#animatediffpipeline), to [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline) which was instantiated from an existing [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline). Again, both pipelines share the same pretrained model, so it’d be a waste of memory to load an entirely new pipeline again.
113
+
114
+ With the [DiffusionPipeline.from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe) API, you can switch between multiple pipelines to take advantage of their different features without increasing memory-usage. It is similar to turning on and off a feature in your pipeline.
115
+
116
+ To switch between tasks (rather than features), use the [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe) method with the [AutoPipeline](../api/pipelines/auto_pipeline) class, which automatically identifies the pipeline class based on the task (learn more in the [AutoPipeline](../tutorials/autopipeline) tutorial).
117
+
118
+ Let’s start with a [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) and then reuse the loaded model components to create a [StableDiffusionSAGPipeline](/docs/diffusers/main/en/api/pipelines/self_attention_guidance#diffusers.StableDiffusionSAGPipeline) to increase generation quality. You’ll use the [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) with an [IP-Adapter](./ip_adapter) to generate a bear eating pizza.
119
+
120
+ Copied
121
+
122
+ from diffusers import DiffusionPipeline, StableDiffusionSAGPipeline
123
+ import torch
124
+ import gc
125
+ from diffusers.utils import load\_image
126
+ from accelerate.utils import compute\_module\_sizes
127
+
128
+ image = load\_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/load\_neg\_embed.png")
129
+
130
+ pipe\_sd = DiffusionPipeline.from\_pretrained("SG161222/Realistic\_Vision\_V6.0\_B1\_noVAE", torch\_dtype=torch.float16)
131
+ pipe\_sd.load\_ip\_adapter("h94/IP-Adapter", subfolder="models", weight\_name="ip-adapter\_sd15.bin")
132
+ pipe\_sd.set\_ip\_adapter\_scale(0.6)
133
+ pipe\_sd.to("cuda")
134
+
135
+ generator = torch.Generator(device="cpu").manual\_seed(33)
136
+ out\_sd = pipe\_sd(
137
+ prompt="bear eats pizza",
138
+ negative\_prompt="wrong white balance, dark, sketches,worst quality,low quality",
139
+ ip\_adapter\_image=image,
140
+ num\_inference\_steps=50,
141
+ generator=generator,
142
+ ).images\[0\]
143
+ out\_sd
144
+
145
+ ![](https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/from_pipe_out_sd_0.png)
146
+
147
+ For reference, you can check how much memory this process consumed.
148
+
149
+ Copied
150
+
151
+ def bytes\_to\_giga\_bytes(bytes):
152
+ return bytes / 1024 / 1024 / 1024
153
+ print(f"Max memory allocated: {bytes\_to\_giga\_bytes(torch.cuda.max\_memory\_allocated())} GB")
154
+ "Max memory allocated: 4.406213283538818 GB"
155
+
156
+ Now, reuse the same pipeline components from [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) in [StableDiffusionSAGPipeline](/docs/diffusers/main/en/api/pipelines/self_attention_guidance#diffusers.StableDiffusionSAGPipeline) with the [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe) method.
157
+
158
+ Some pipeline methods may not function properly on new pipelines created with [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe). For instance, the [enable\_model\_cpu\_offload()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.enable_model_cpu_offload) method installs hooks on the model components based on a unique offloading sequence for each pipeline. If the models are executed in a different order in the new pipeline, the CPU offloading may not work correctly.
159
+
160
+ To ensure everything works as expected, we recommend re-applying a pipeline method on a new pipeline created with [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe).
161
+
162
+ Copied
163
+
164
+ pipe\_sag = StableDiffusionSAGPipeline.from\_pipe(
165
+ pipe\_sd
166
+ )
167
+
168
+ generator = torch.Generator(device="cpu").manual\_seed(33)
169
+ out\_sag = pipe\_sag(
170
+ prompt="bear eats pizza",
171
+ negative\_prompt="wrong white balance, dark, sketches,worst quality,low quality",
172
+ ip\_adapter\_image=image,
173
+ num\_inference\_steps=50,
174
+ generator=generator,
175
+ guidance\_scale=1.0,
176
+ sag\_scale=0.75
177
+ ).images\[0\]
178
+ out\_sag
179
+
180
+ ![](https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/from_pipe_out_sag_1.png)
181
+
182
+ If you check the memory usage, you’ll see it remains the same as before because [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) and [StableDiffusionSAGPipeline](/docs/diffusers/main/en/api/pipelines/self_attention_guidance#diffusers.StableDiffusionSAGPipeline) are sharing the same pipeline components. This allows you to use them interchangeably without any additional memory overhead.
183
+
184
+ Copied
185
+
186
+ print(f"Max memory allocated: {bytes\_to\_giga\_bytes(torch.cuda.max\_memory\_allocated())} GB")
187
+ "Max memory allocated: 4.406213283538818 GB"
188
+
189
+ Let’s animate the image with the [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline) and also add a `MotionAdapter` module to the pipeline. For the [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline), you need to unload the IP-Adapter first and reload it _after_ you’ve created your new pipeline (this only applies to the [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline)).
190
+
191
+ Copied
192
+
193
+ from diffusers import AnimateDiffPipeline, MotionAdapter, DDIMScheduler
194
+ from diffusers.utils import export\_to\_gif
195
+
196
+ pipe\_sag.unload\_ip\_adapter()
197
+ adapter = MotionAdapter.from\_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch\_dtype=torch.float16)
198
+
199
+ pipe\_animate = AnimateDiffPipeline.from\_pipe(pipe\_sd, motion\_adapter=adapter)
200
+ pipe\_animate.scheduler = DDIMScheduler.from\_config(pipe\_animate.scheduler.config, beta\_schedule="linear")
201
+ \# load IP-Adapter and LoRA weights again
202
+ pipe\_animate.load\_ip\_adapter("h94/IP-Adapter", subfolder="models", weight\_name="ip-adapter\_sd15.bin")
203
+ pipe\_animate.load\_lora\_weights("guoyww/animatediff-motion-lora-zoom-out", adapter\_name="zoom-out")
204
+ pipe\_animate.to("cuda")
205
+
206
+ generator = torch.Generator(device="cpu").manual\_seed(33)
207
+ pipe\_animate.set\_adapters("zoom-out", adapter\_weights=0.75)
208
+ out = pipe\_animate(
209
+ prompt="bear eats pizza",
210
+ num\_frames=16,
211
+ num\_inference\_steps=50,
212
+ ip\_adapter\_image=image,
213
+ generator=generator,
214
+ ).frames\[0\]
215
+ export\_to\_gif(out, "out\_animate.gif")
216
+
217
+ ![](https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/from_pipe_out_animate_3.gif)
218
+
219
+ The [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline) is more memory-intensive and consumes 15GB of memory (see the [Memory-usage of from\_pipe](#memory-usage-of-from_pipe) section to learn what this means for your memory-usage).
220
+
221
+ Copied
222
+
223
+ print(f"Max memory allocated: {bytes\_to\_giga\_bytes(torch.cuda.max\_memory\_allocated())} GB")
224
+ "Max memory allocated: 15.178664207458496 GB"
225
+
226
+ ### [](#modify-frompipe-components)Modify from\_pipe components
227
+
228
+ Pipelines loaded with [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe) can be customized with different model components or methods. However, whenever you modify the _state_ of the model components, it affects all the other pipelines that share the same components. For example, if you call [unload\_ip\_adapter()](/docs/diffusers/main/en/api/loaders/ip_adapter#diffusers.loaders.IPAdapterMixin.unload_ip_adapter) on the [StableDiffusionSAGPipeline](/docs/diffusers/main/en/api/pipelines/self_attention_guidance#diffusers.StableDiffusionSAGPipeline), you won’t be able to use IP-Adapter with the [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) because it’s been removed from their shared components.
229
+
230
+ Copied
231
+
232
+ pipe.sag\_unload\_ip\_adapter()
233
+
234
+ generator = torch.Generator(device="cpu").manual\_seed(33)
235
+ out\_sd = pipe\_sd(
236
+ prompt="bear eats pizza",
237
+ negative\_prompt="wrong white balance, dark, sketches,worst quality,low quality",
238
+ ip\_adapter\_image=image,
239
+ num\_inference\_steps=50,
240
+ generator=generator,
241
+ ).images\[0\]
242
+ "AttributeError: 'NoneType' object has no attribute 'image\_projection\_layers'"
243
+
244
+ ### [](#memory-usage-of-frompipe)Memory usage of from\_pipe
245
+
246
+ The memory requirement of loading multiple pipelines with [from\_pipe()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pipe) is determined by the pipeline with the highest memory-usage regardless of the number of pipelines you create.
247
+
248
+ Pipeline
249
+
250
+ Memory usage (GB)
251
+
252
+ StableDiffusionPipeline
253
+
254
+ 4.400
255
+
256
+ StableDiffusionSAGPipeline
257
+
258
+ 4.400
259
+
260
+ AnimateDiffPipeline
261
+
262
+ 15.178
263
+
264
+ The [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline) has the highest memory requirement, so the _total memory-usage_ is based only on the [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline). Your memory-usage will not increase if you create additional pipelines as long as their memory requirements doesn’t exceed that of the [AnimateDiffPipeline](/docs/diffusers/main/en/api/pipelines/animatediff#diffusers.AnimateDiffPipeline). Each pipeline can be used interchangeably without any additional memory overhead.
265
+
266
+ [](#safety-checker)Safety checker
267
+ ---------------------------------
268
+
269
+ Diffusers implements a [safety checker](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) for Stable Diffusion models which can generate harmful content. The safety checker screens the generated output against known hardcoded not-safe-for-work (NSFW) content. If for whatever reason you’d like to disable the safety checker, pass `safety_checker=None` to the [from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) method.
270
+
271
+ Copied
272
+
273
+ from diffusers import DiffusionPipeline
274
+
275
+ pipeline = DiffusionPipeline.from\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", safety\_checker=None, use\_safetensors=True)
276
+ """
277
+ You have disabled the safety checker for <class 'diffusers.pipelines.stable\_diffusion.pipeline\_stable\_diffusion.StableDiffusionPipeline'> by passing \`safety\_checker=None\`. Ensure that you abide by the conditions of the Stable Diffusion license and do not expose unfiltered results in services or applications open to the public. Both the diffusers team and Hugging Face strongly recommend keeping the safety filter enabled in all public-facing circumstances, disabling it only for use cases that involve analyzing network behavior or auditing its results. For more information, please have a look at https://github.com/huggingface/diffusers/pull/254 .
278
+ """
279
+
280
+ [](#checkpoint-variants)Checkpoint variants
281
+ -------------------------------------------
282
+
283
+ A checkpoint variant is usually a checkpoint whose weights are:
284
+
285
+ * Stored in a different floating point type, such as [torch.float16](https://pytorch.org/docs/stable/tensors.html#data-types), because it only requires half the bandwidth and storage to download. You can’t use this variant if you’re continuing training or using a CPU.
286
+ * Non-exponential mean averaged (EMA) weights which shouldn’t be used for inference. You should use this variant to continue finetuning a model.
287
+
288
+ When the checkpoints have identical model structures, but they were trained on different datasets and with a different training setup, they should be stored in separate repositories. For example, [stabilityai/stable-diffusion-2](https://hf.co/stabilityai/stable-diffusion-2) and [stabilityai/stable-diffusion-2-1](https://hf.co/stabilityai/stable-diffusion-2-1) are stored in separate repositories.
289
+
290
+ Otherwise, a variant is **identical** to the original checkpoint. They have exactly the same serialization format (like [safetensors](./using_safetensors)), model structure, and their weights have identical tensor shapes.
291
+
292
+ **checkpoint type**
293
+
294
+ **weight name**
295
+
296
+ **argument for loading weights**
297
+
298
+ original
299
+
300
+ diffusion\_pytorch\_model.safetensors
301
+
302
+ floating point
303
+
304
+ diffusion\_pytorch\_model.fp16.safetensors
305
+
306
+ `variant`, `torch_dtype`
307
+
308
+ non-EMA
309
+
310
+ diffusion\_pytorch\_model.non\_ema.safetensors
311
+
312
+ `variant`
313
+
314
+ There are two important arguments for loading variants:
315
+
316
+ * `torch_dtype` specifies the floating point precision of the loaded checkpoint. For example, if you want to save bandwidth by loading a fp16 variant, you should set `variant="fp16"` and `torch_dtype=torch.float16` to _convert the weights_ to fp16. Otherwise, the fp16 weights are converted to the default fp32 precision.
317
+
318
+ If you only set `torch_dtype=torch.float16`, the default fp32 weights are downloaded first and then converted to fp16.
319
+
320
+ * `variant` specifies which files should be loaded from the repository. For example, if you want to load a non-EMA variant of a UNet from [stable-diffusion-v1-5/stable-diffusion-v1-5](https://hf.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main/unet), set `variant="non_ema"` to download the `non_ema` file.
321
+
322
+
323
+ fp16
324
+
325
+ non-EMA
326
+
327
+ Copied
328
+
329
+ from diffusers import DiffusionPipeline
330
+ import torch
331
+
332
+ pipeline = DiffusionPipeline.from\_pretrained(
333
+ "stable-diffusion-v1-5/stable-diffusion-v1-5", variant="fp16", torch\_dtype=torch.float16, use\_safetensors=True
334
+ )
335
+
336
+ Use the `variant` parameter in the [DiffusionPipeline.save\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.save_pretrained) method to save a checkpoint as a different floating point type or as a non-EMA variant. You should try save a variant to the same folder as the original checkpoint, so you have the option of loading both from the same folder.
337
+
338
+ fp16
339
+
340
+ non\_ema
341
+
342
+ Copied
343
+
344
+ from diffusers import DiffusionPipeline
345
+
346
+ pipeline.save\_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", variant="fp16")
347
+
348
+ If you don’t save the variant to an existing folder, you must specify the `variant` argument otherwise it’ll throw an `Exception` because it can’t find the original checkpoint.
349
+
350
+ Copied
351
+
352
+ \# πŸ‘Ž this won't work
353
+ pipeline = DiffusionPipeline.from\_pretrained(
354
+ "./stable-diffusion-v1-5", torch\_dtype=torch.float16, use\_safetensors=True
355
+ )
356
+ \# πŸ‘ this works
357
+ pipeline = DiffusionPipeline.from\_pretrained(
358
+ "./stable-diffusion-v1-5", variant="fp16", torch\_dtype=torch.float16, use\_safetensors=True
359
+ )
360
+
361
+ [](#diffusionpipeline-explained)DiffusionPipeline explained
362
+ -----------------------------------------------------------
363
+
364
+ As a class method, [DiffusionPipeline.from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) is responsible for two things:
365
+
366
+ * Download the latest version of the folder structure required for inference and cache it. If the latest folder structure is available in the local cache, [DiffusionPipeline.from\_pretrained()](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline.from_pretrained) reuses the cache and won’t redownload the files.
367
+ * Load the cached weights into the correct pipeline [class](../api/pipelines/overview#diffusers-summary) - retrieved from the `model_index.json` file - and return an instance of it.
368
+
369
+ The pipelines’ underlying folder structure corresponds directly with their class instances. For example, the [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline) corresponds to the folder structure in [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5).
370
+
371
+ Copied
372
+
373
+ from diffusers import DiffusionPipeline
374
+
375
+ repo\_id = "stable-diffusion-v1-5/stable-diffusion-v1-5"
376
+ pipeline = DiffusionPipeline.from\_pretrained(repo\_id, use\_safetensors=True)
377
+ print(pipeline)
378
+
379
+ You’ll see pipeline is an instance of [StableDiffusionPipeline](/docs/diffusers/main/en/api/pipelines/stable_diffusion/text2img#diffusers.StableDiffusionPipeline), which consists of seven components:
380
+
381
+ * `"feature_extractor"`: a [CLIPImageProcessor](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPImageProcessor) from πŸ€— Transformers.
382
+ * `"safety_checker"`: a [component](https://github.com/huggingface/diffusers/blob/e55687e1e15407f60f32242027b7bb8170e58266/src/diffusers/pipelines/stable_diffusion/safety_checker.py#L32) for screening against harmful content.
383
+ * `"scheduler"`: an instance of [PNDMScheduler](/docs/diffusers/main/en/api/schedulers/pndm#diffusers.PNDMScheduler).
384
+ * `"text_encoder"`: a [CLIPTextModel](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPTextModel) from πŸ€— Transformers.
385
+ * `"tokenizer"`: a [CLIPTokenizer](https://huggingface.co/docs/transformers/main/en/model_doc/clip#transformers.CLIPTokenizer) from πŸ€— Transformers.
386
+ * `"unet"`: an instance of [UNet2DConditionModel](/docs/diffusers/main/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel).
387
+ * `"vae"`: an instance of [AutoencoderKL](/docs/diffusers/main/en/api/models/autoencoderkl#diffusers.AutoencoderKL).
388
+
389
+ Copied
390
+
391
+ StableDiffusionPipeline {
392
+ "feature\_extractor": \[
393
+ "transformers",
394
+ "CLIPImageProcessor"
395
+ \],
396
+ "safety\_checker": \[
397
+ "stable\_diffusion",
398
+ "StableDiffusionSafetyChecker"
399
+ \],
400
+ "scheduler": \[
401
+ "diffusers",
402
+ "PNDMScheduler"
403
+ \],
404
+ "text\_encoder": \[
405
+ "transformers",
406
+ "CLIPTextModel"
407
+ \],
408
+ "tokenizer": \[
409
+ "transformers",
410
+ "CLIPTokenizer"
411
+ \],
412
+ "unet": \[
413
+ "diffusers",
414
+ "UNet2DConditionModel"
415
+ \],
416
+ "vae": \[
417
+ "diffusers",
418
+ "AutoencoderKL"
419
+ \]
420
+ }
421
+
422
+ Compare the components of the pipeline instance to the [`stable-diffusion-v1-5/stable-diffusion-v1-5`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/tree/main) folder structure, and you’ll see there is a separate folder for each of the components in the repository:
423
+
424
+ Copied
425
+
426
+ .
427
+ β”œβ”€β”€ feature\_extractor
428
+ β”‚Β Β  └── preprocessor\_config.json
429
+ β”œβ”€β”€ model\_index.json
430
+ β”œβ”€β”€ safety\_checker
431
+ β”‚Β Β  β”œβ”€β”€ config.json
432
+ | β”œβ”€β”€ model.fp16.safetensors
433
+ β”‚ β”œβ”€β”€ model.safetensors
434
+ β”‚ β”œβ”€β”€ pytorch\_model.bin
435
+ | └── pytorch\_model.fp16.bin
436
+ β”œβ”€β”€ scheduler
437
+ β”‚Β Β  └── scheduler\_config.json
438
+ β”œβ”€β”€ text\_encoder
439
+ β”‚Β Β  β”œβ”€β”€ config.json
440
+ | β”œβ”€β”€ model.fp16.safetensors
441
+ β”‚ β”œβ”€β”€ model.safetensors
442
+ β”‚ |── pytorch\_model.bin
443
+ | └── pytorch\_model.fp16.bin
444
+ β”œβ”€β”€ tokenizer
445
+ β”‚Β Β  β”œβ”€β”€ merges.txt
446
+ β”‚Β Β  β”œβ”€β”€ special\_tokens\_map.json
447
+ β”‚Β Β  β”œβ”€β”€ tokenizer\_config.json
448
+ β”‚Β Β  └── vocab.json
449
+ β”œβ”€β”€ unet
450
+ β”‚Β Β  β”œβ”€β”€ config.json
451
+ β”‚Β Β  β”œβ”€β”€ diffusion\_pytorch\_model.bin
452
+ | |── diffusion\_pytorch\_model.fp16.bin
453
+ β”‚ |── diffusion\_pytorch\_model.f16.safetensors
454
+ β”‚ |── diffusion\_pytorch\_model.non\_ema.bin
455
+ β”‚ |── diffusion\_pytorch\_model.non\_ema.safetensors
456
+ β”‚ └── diffusion\_pytorch\_model.safetensors
457
+ |── vae
458
+ . β”œβ”€β”€ config.json
459
+ . β”œβ”€β”€ diffusion\_pytorch\_model.bin
460
+ β”œβ”€β”€ diffusion\_pytorch\_model.fp16.bin
461
+ β”œβ”€β”€ diffusion\_pytorch\_model.fp16.safetensors
462
+ └── diffusion\_pytorch\_model.safetensors
463
+
464
+ You can access each of the components of the pipeline as an attribute to view its configuration:
465
+
466
+ Copied
467
+
468
+ pipeline.tokenizer
469
+ CLIPTokenizer(
470
+ name\_or\_path="/root/.cache/huggingface/hub/models--runwayml--stable-diffusion-v1-5/snapshots/39593d5650112b4cc580433f6b0435385882d819/tokenizer",
471
+ vocab\_size=49408,
472
+ model\_max\_length=77,
473
+ is\_fast=False,
474
+ padding\_side="right",
475
+ truncation\_side="right",
476
+ special\_tokens={
477
+ "bos\_token": AddedToken("<|startoftext|>", rstrip=False, lstrip=False, single\_word=False, normalized=True),
478
+ "eos\_token": AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single\_word=False, normalized=True),
479
+ "unk\_token": AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single\_word=False, normalized=True),
480
+ "pad\_token": "<|endoftext|>",
481
+ },
482
+ clean\_up\_tokenization\_spaces=True
483
+ )
484
+
485
+ Every pipeline expects a [`model_index.json`](https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5/blob/main/model_index.json) file that tells the [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline):
486
+
487
+ * which pipeline class to load from `_class_name`
488
+ * which version of 🧨 Diffusers was used to create the model in `_diffusers_version`
489
+ * what components from which library are stored in the subfolders (`name` corresponds to the component and subfolder name, `library` corresponds to the name of the library to load the class from, and `class` corresponds to the class name)
490
+
491
+ Copied
492
+
493
+ {
494
+ "\_class\_name": "StableDiffusionPipeline",
495
+ "\_diffusers\_version": "0.6.0",
496
+ "feature\_extractor": \[
497
+ "transformers",
498
+ "CLIPImageProcessor"
499
+ \],
500
+ "safety\_checker": \[
501
+ "stable\_diffusion",
502
+ "StableDiffusionSafetyChecker"
503
+ \],
504
+ "scheduler": \[
505
+ "diffusers",
506
+ "PNDMScheduler"
507
+ \],
508
+ "text\_encoder": \[
509
+ "transformers",
510
+ "CLIPTextModel"
511
+ \],
512
+ "tokenizer": \[
513
+ "transformers",
514
+ "CLIPTokenizer"
515
+ \],
516
+ "unet": \[
517
+ "diffusers",
518
+ "UNet2DConditionModel"
519
+ \],
520
+ "vae": \[
521
+ "diffusers",
522
+ "AutoencoderKL"
523
+ \]
524
+ }
525
+
526
+ [< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/using-diffusers/loading.md)
527
+
528
+ [←Working with big models](/docs/diffusers/main/en/tutorials/inference_with_big_models) [Load community pipelines and componentsβ†’](/docs/diffusers/main/en/using-diffusers/custom_pipeline_overview)
docs/diffusers/Using Diffusers for CogVideoX.md ADDED
@@ -0,0 +1,683 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [](#cogvideox)CogVideoX
2
+ =======================
3
+
4
+ ![LoRA](https://img.shields.io/badge/LoRA-d8b4fe?style=flat)
5
+
6
+ [CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer](https://arxiv.org/abs/2408.06072) from Tsinghua University & ZhipuAI, by Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang.
7
+
8
+ The abstract from the paper is:
9
+
10
+ _We introduce CogVideoX, a large-scale diffusion transformer model designed for generating videos based on text prompts. To efficently model video data, we propose to levearge a 3D Variational Autoencoder (VAE) to compresses videos along both spatial and temporal dimensions. To improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. By employing a progressive training technique, CogVideoX is adept at producing coherent, long-duration videos characterized by significant motion. In addition, we develop an effectively text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method. It significantly helps enhance the performance of CogVideoX, improving both generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of CogVideoX-2B is publicly available at [https://github.com/THUDM/CogVideo](https://github.com/THUDM/CogVideo)._
11
+
12
+ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
13
+
14
+ This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
15
+
16
+ There are three official CogVideoX checkpoints for text-to-video and video-to-video.
17
+
18
+ checkpoints
19
+
20
+ recommended inference dtype
21
+
22
+ [`THUDM/CogVideoX-2b`](https://huggingface.co/THUDM/CogVideoX-2b)
23
+
24
+ torch.float16
25
+
26
+ [`THUDM/CogVideoX-5b`](https://huggingface.co/THUDM/CogVideoX-5b)
27
+
28
+ torch.bfloat16
29
+
30
+ [`THUDM/CogVideoX1.5-5b`](https://huggingface.co/THUDM/CogVideoX1.5-5b)
31
+
32
+ torch.bfloat16
33
+
34
+ There are two official CogVideoX checkpoints available for image-to-video.
35
+
36
+ checkpoints
37
+
38
+ recommended inference dtype
39
+
40
+ [`THUDM/CogVideoX-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-5b-I2V)
41
+
42
+ torch.bfloat16
43
+
44
+ [`THUDM/CogVideoX-1.5-5b-I2V`](https://huggingface.co/THUDM/CogVideoX-1.5-5b-I2V)
45
+
46
+ torch.bfloat16
47
+
48
+ For the CogVideoX 1.5 series:
49
+
50
+ * Text-to-video (T2V) works best at a resolution of 1360x768 because it was trained with that specific resolution.
51
+ * Image-to-video (I2V) works for multiple resolutions. The width can vary from 768 to 1360, but the height must be 768. The height/width must be divisible by 16.
52
+ * Both T2V and I2V models support generation with 81 and 161 frames and work best at this value. Exporting videos at 16 FPS is recommended.
53
+
54
+ There are two official CogVideoX checkpoints that support pose controllable generation (by the [Alibaba-PAI](https://huggingface.co/alibaba-pai) team).
55
+
56
+ checkpoints
57
+
58
+ recommended inference dtype
59
+
60
+ [`alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-2b-Pose)
61
+
62
+ torch.bfloat16
63
+
64
+ [`alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose`](https://huggingface.co/alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose)
65
+
66
+ torch.bfloat16
67
+
68
+ [](#inference)Inference
69
+ -----------------------
70
+
71
+ Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
72
+
73
+ First, load the pipeline:
74
+
75
+ Copied
76
+
77
+ import torch
78
+ from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline
79
+ from diffusers.utils import export\_to\_video,load\_image
80
+ pipe = CogVideoXPipeline.from\_pretrained("THUDM/CogVideoX-5b").to("cuda") \# or "THUDM/CogVideoX-2b"
81
+
82
+ If you are using the image-to-video pipeline, load it as follows:
83
+
84
+ Copied
85
+
86
+ pipe = CogVideoXImageToVideoPipeline.from\_pretrained("THUDM/CogVideoX-5b-I2V").to("cuda")
87
+
88
+ Then change the memory layout of the pipelines `transformer` component to `torch.channels_last`:
89
+
90
+ Copied
91
+
92
+ pipe.transformer.to(memory\_format=torch.channels\_last)
93
+
94
+ Compile the components and run inference:
95
+
96
+ Copied
97
+
98
+ pipe.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
99
+
100
+ \# CogVideoX works well with long and well-described prompts
101
+ prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
102
+ video = pipe(prompt=prompt, guidance\_scale=6, num\_inference\_steps=50).frames\[0\]
103
+
104
+ The [T2V benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are:
105
+
106
+ Copied
107
+
108
+ Without torch.compile(): Average inference time: 96.89 seconds.
109
+ With torch.compile(): Average inference time: 76.27 seconds.
110
+
111
+ ### [](#memory-optimization)Memory optimization
112
+
113
+ CogVideoX-2b requires about 19 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/a-r-r-o-w/3959a03f15be5c9bd1fe545b09dfcc93) script.
114
+
115
+ * `pipe.enable_model_cpu_offload()`:
116
+ * Without enabling cpu offloading, memory usage is `33 GB`
117
+ * With enabling cpu offloading, memory usage is `19 GB`
118
+ * `pipe.enable_sequential_cpu_offload()`:
119
+ * Similar to `enable_model_cpu_offload` but can significantly reduce memory usage at the cost of slow inference
120
+ * When enabled, memory usage is under `4 GB`
121
+ * `pipe.vae.enable_tiling()`:
122
+ * With enabling cpu offloading and tiling, memory usage is `11 GB`
123
+ * `pipe.vae.enable_slicing()`
124
+
125
+ [](#quantization)Quantization
126
+ -----------------------------
127
+
128
+ Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
129
+
130
+ Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [CogVideoXPipeline](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.CogVideoXPipeline) for inference with bitsandbytes.
131
+
132
+ Copied
133
+
134
+ import torch
135
+ from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, CogVideoXTransformer3DModel, CogVideoXPipeline
136
+ from diffusers.utils import export\_to\_video
137
+ from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
138
+
139
+ quant\_config = BitsAndBytesConfig(load\_in\_8bit=True)
140
+ text\_encoder\_8bit = T5EncoderModel.from\_pretrained(
141
+ "THUDM/CogVideoX-2b",
142
+ subfolder="text\_encoder",
143
+ quantization\_config=quant\_config,
144
+ torch\_dtype=torch.float16,
145
+ )
146
+
147
+ quant\_config = DiffusersBitsAndBytesConfig(load\_in\_8bit=True)
148
+ transformer\_8bit = CogVideoXTransformer3DModel.from\_pretrained(
149
+ "THUDM/CogVideoX-2b",
150
+ subfolder="transformer",
151
+ quantization\_config=quant\_config,
152
+ torch\_dtype=torch.float16,
153
+ )
154
+
155
+ pipeline = CogVideoXPipeline.from\_pretrained(
156
+ "THUDM/CogVideoX-2b",
157
+ text\_encoder=text\_encoder\_8bit,
158
+ transformer=transformer\_8bit,
159
+ torch\_dtype=torch.float16,
160
+ device\_map="balanced",
161
+ )
162
+
163
+ prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
164
+ video = pipeline(prompt=prompt, guidance\_scale=6, num\_inference\_steps=50).frames\[0\]
165
+ export\_to\_video(video, "ship.mp4", fps=8)
166
+
167
+ [](#diffusers.CogVideoXPipeline)CogVideoXPipeline
168
+ -------------------------------------------------
169
+
170
+ ### class diffusers.CogVideoXPipeline
171
+
172
+ [](#diffusers.CogVideoXPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py#L147)
173
+
174
+ ( tokenizer: T5Tokenizertext\_encoder: T5EncoderModelvae: AutoencoderKLCogVideoXtransformer: CogVideoXTransformer3DModelscheduler: typing.Union\[diffusers.schedulers.scheduling\_ddim\_cogvideox.CogVideoXDDIMScheduler, diffusers.schedulers.scheduling\_dpm\_cogvideox.CogVideoXDPMScheduler\] )
175
+
176
+ Parameters
177
+
178
+ * [](#diffusers.CogVideoXPipeline.vae)**vae** ([AutoencoderKL](/docs/diffusers/main/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) β€” Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
179
+ * [](#diffusers.CogVideoXPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) β€” Frozen text-encoder. CogVideoX uses [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel); specifically the [t5-v1\_1-xxl](https://huggingface.co/PixArt-alpha/PixArt-alpha/tree/main/t5-v1_1-xxl) variant.
180
+ * [](#diffusers.CogVideoXPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) β€” Tokenizer of class [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
181
+ * [](#diffusers.CogVideoXPipeline.transformer)**transformer** ([CogVideoXTransformer3DModel](/docs/diffusers/main/en/api/models/cogvideox_transformer3d#diffusers.CogVideoXTransformer3DModel)) β€” A text conditioned `CogVideoXTransformer3DModel` to denoise the encoded video latents.
182
+ * [](#diffusers.CogVideoXPipeline.scheduler)**scheduler** ([SchedulerMixin](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin)) β€” A scheduler to be used in combination with `transformer` to denoise the encoded video latents.
183
+
184
+ Pipeline for text-to-video generation using CogVideoX.
185
+
186
+ This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
187
+
188
+ #### \_\_call\_\_
189
+
190
+ [](#diffusers.CogVideoXPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py#L505)
191
+
192
+ ( prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Noneheight: typing.Optional\[int\] = Nonewidth: typing.Optional\[int\] = Nonenum\_frames: typing.Optional\[int\] = Nonenum\_inference\_steps: int = 50timesteps: typing.Optional\[typing.List\[int\]\] = Noneguidance\_scale: float = 6use\_dynamic\_cfg: bool = Falsenum\_videos\_per\_prompt: int = 1eta: float = 0.0generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.FloatTensor\] = Noneprompt\_embeds: typing.Optional\[torch.FloatTensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.FloatTensor\] = Noneoutput\_type: str = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 226 ) β†’ export const metadata = 'undefined';[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
193
+
194
+ Expand 19 parameters
195
+
196
+ Parameters
197
+
198
+ * [](#diffusers.CogVideoXPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
199
+ * [](#diffusers.CogVideoXPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
200
+ * [](#diffusers.CogVideoXPipeline.__call__.height)**height** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) β€” The height in pixels of the generated image. This is set to 480 by default for the best results.
201
+ * [](#diffusers.CogVideoXPipeline.__call__.width)**width** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) β€” The width in pixels of the generated image. This is set to 720 by default for the best results.
202
+ * [](#diffusers.CogVideoXPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `48`) β€” Number of frames to generate. Must be divisible by self.vae\_scale\_factor\_temporal. Generated video will contain 1 extra frame because CogVideoX is conditioned with (num\_seconds \* fps + 1) frames where num\_seconds is 6 and fps is 8. However, since videos can be saved at any fps, the only condition that needs to be satisfied is that of divisibility mentioned above.
203
+ * [](#diffusers.CogVideoXPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) β€” The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
204
+ * [](#diffusers.CogVideoXPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) β€” Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
205
+ * [](#diffusers.CogVideoXPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, _optional_, defaults to 7.0) β€” Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
206
+ * [](#diffusers.CogVideoXPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” The number of videos to generate per prompt.
207
+ * [](#diffusers.CogVideoXPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) β€” One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
208
+ * [](#diffusers.CogVideoXPipeline.__call__.latents)**latents** (`torch.FloatTensor`, _optional_) β€” Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
209
+ * [](#diffusers.CogVideoXPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.FloatTensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
210
+ * [](#diffusers.CogVideoXPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
211
+ * [](#diffusers.CogVideoXPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) β€” The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
212
+ * [](#diffusers.CogVideoXPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) β€” Whether or not to return a `~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput` instead of a plain tuple.
213
+ * [](#diffusers.CogVideoXPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) β€” A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
214
+ * [](#diffusers.CogVideoXPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) β€” A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
215
+ * [](#diffusers.CogVideoXPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) β€” The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
216
+ * [](#diffusers.CogVideoXPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int`, defaults to `226`) β€” Maximum sequence length in encoded prompt. Must be consistent with `self.transformer.config.max_text_seq_length` otherwise may lead to poor results.
217
+
218
+ Returns
219
+
220
+ export const metadata = 'undefined';
221
+
222
+ [CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
223
+
224
+ export const metadata = 'undefined';
225
+
226
+ [CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) if `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
227
+
228
+ Function invoked when calling the pipeline for generation.
229
+
230
+ [](#diffusers.CogVideoXPipeline.__call__.example)
231
+
232
+ Examples:
233
+
234
+ Copied
235
+
236
+ \>>> import torch
237
+ \>>> from diffusers import CogVideoXPipeline
238
+ \>>> from diffusers.utils import export\_to\_video
239
+
240
+ \>>> \# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
241
+ \>>> pipe = CogVideoXPipeline.from\_pretrained("THUDM/CogVideoX-2b", torch\_dtype=torch.float16).to("cuda")
242
+ \>>> prompt = (
243
+ ... "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
244
+ ... "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
245
+ ... "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
246
+ ... "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
247
+ ... "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
248
+ ... "atmosphere of this unique musical performance."
249
+ ... )
250
+ \>>> video = pipe(prompt=prompt, guidance\_scale=6, num\_inference\_steps=50).frames\[0\]
251
+ \>>> export\_to\_video(video, "output.mp4", fps=8)
252
+
253
+ #### encode\_prompt
254
+
255
+ [](#diffusers.CogVideoXPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py#L244)
256
+
257
+ ( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
258
+
259
+ Parameters
260
+
261
+ * [](#diffusers.CogVideoXPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” prompt to be encoded
262
+ * [](#diffusers.CogVideoXPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
263
+ * [](#diffusers.CogVideoXPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) β€” Whether to use classifier free guidance or not.
264
+ * [](#diffusers.CogVideoXPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
265
+ * [](#diffusers.CogVideoXPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
266
+ * [](#diffusers.CogVideoXPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
267
+ * [](#diffusers.CogVideoXPipeline.encode_prompt.device)**device** β€” (`torch.device`, _optional_): torch device
268
+ * [](#diffusers.CogVideoXPipeline.encode_prompt.dtype)**dtype** β€” (`torch.dtype`, _optional_): torch dtype
269
+
270
+ Encodes the prompt into text encoder hidden states.
271
+
272
+ #### fuse\_qkv\_projections
273
+
274
+ [](#diffusers.CogVideoXPipeline.fuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py#L428)
275
+
276
+ ( )
277
+
278
+ Enables fused QKV projections.
279
+
280
+ #### unfuse\_qkv\_projections
281
+
282
+ [](#diffusers.CogVideoXPipeline.unfuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox.py#L433)
283
+
284
+ ( )
285
+
286
+ Disable QKV projection fusion if enabled.
287
+
288
+ [](#diffusers.CogVideoXImageToVideoPipeline)CogVideoXImageToVideoPipeline
289
+ -------------------------------------------------------------------------
290
+
291
+ ### class diffusers.CogVideoXImageToVideoPipeline
292
+
293
+ [](#diffusers.CogVideoXImageToVideoPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L164)
294
+
295
+ ( tokenizer: T5Tokenizertext\_encoder: T5EncoderModelvae: AutoencoderKLCogVideoXtransformer: CogVideoXTransformer3DModelscheduler: typing.Union\[diffusers.schedulers.scheduling\_ddim\_cogvideox.CogVideoXDDIMScheduler, diffusers.schedulers.scheduling\_dpm\_cogvideox.CogVideoXDPMScheduler\] )
296
+
297
+ Parameters
298
+
299
+ * [](#diffusers.CogVideoXImageToVideoPipeline.vae)**vae** ([AutoencoderKL](/docs/diffusers/main/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) β€” Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
300
+ * [](#diffusers.CogVideoXImageToVideoPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) β€” Frozen text-encoder. CogVideoX uses [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel); specifically the [t5-v1\_1-xxl](https://huggingface.co/PixArt-alpha/PixArt-alpha/tree/main/t5-v1_1-xxl) variant.
301
+ * [](#diffusers.CogVideoXImageToVideoPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) β€” Tokenizer of class [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
302
+ * [](#diffusers.CogVideoXImageToVideoPipeline.transformer)**transformer** ([CogVideoXTransformer3DModel](/docs/diffusers/main/en/api/models/cogvideox_transformer3d#diffusers.CogVideoXTransformer3DModel)) β€” A text conditioned `CogVideoXTransformer3DModel` to denoise the encoded video latents.
303
+ * [](#diffusers.CogVideoXImageToVideoPipeline.scheduler)**scheduler** ([SchedulerMixin](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin)) β€” A scheduler to be used in combination with `transformer` to denoise the encoded video latents.
304
+
305
+ Pipeline for image-to-video generation using CogVideoX.
306
+
307
+ This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
308
+
309
+ #### \_\_call\_\_
310
+
311
+ [](#diffusers.CogVideoXImageToVideoPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L602)
312
+
313
+ ( image: typing.Union\[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List\[PIL.Image.Image\], typing.List\[numpy.ndarray\], typing.List\[torch.Tensor\]\]prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Noneheight: typing.Optional\[int\] = Nonewidth: typing.Optional\[int\] = Nonenum\_frames: int = 49num\_inference\_steps: int = 50timesteps: typing.Optional\[typing.List\[int\]\] = Noneguidance\_scale: float = 6use\_dynamic\_cfg: bool = Falsenum\_videos\_per\_prompt: int = 1eta: float = 0.0generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.FloatTensor\] = Noneprompt\_embeds: typing.Optional\[torch.FloatTensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.FloatTensor\] = Noneoutput\_type: str = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 226 ) β†’ export const metadata = 'undefined';[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
314
+
315
+ Expand 20 parameters
316
+
317
+ Parameters
318
+
319
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.image)**image** (`PipelineImageInput`) β€” The input image to condition the generation on. Must be an image, a list of images or a `torch.Tensor`.
320
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
321
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
322
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.height)**height** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) β€” The height in pixels of the generated image. This is set to 480 by default for the best results.
323
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.width)**width** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) β€” The width in pixels of the generated image. This is set to 720 by default for the best results.
324
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `48`) β€” Number of frames to generate. Must be divisible by self.vae\_scale\_factor\_temporal. Generated video will contain 1 extra frame because CogVideoX is conditioned with (num\_seconds \* fps + 1) frames where num\_seconds is 6 and fps is 8. However, since videos can be saved at any fps, the only condition that needs to be satisfied is that of divisibility mentioned above.
325
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) β€” The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
326
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) β€” Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
327
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, _optional_, defaults to 7.0) β€” Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
328
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” The number of videos to generate per prompt.
329
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) β€” One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
330
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.latents)**latents** (`torch.FloatTensor`, _optional_) β€” Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
331
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.FloatTensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
332
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
333
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) β€” The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
334
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) β€” Whether or not to return a `~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput` instead of a plain tuple.
335
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) β€” A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
336
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) β€” A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
337
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) β€” The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
338
+ * [](#diffusers.CogVideoXImageToVideoPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int`, defaults to `226`) β€” Maximum sequence length in encoded prompt. Must be consistent with `self.transformer.config.max_text_seq_length` otherwise may lead to poor results.
339
+
340
+ Returns
341
+
342
+ export const metadata = 'undefined';
343
+
344
+ [CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
345
+
346
+ export const metadata = 'undefined';
347
+
348
+ [CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) if `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
349
+
350
+ Function invoked when calling the pipeline for generation.
351
+
352
+ [](#diffusers.CogVideoXImageToVideoPipeline.__call__.example)
353
+
354
+ Examples:
355
+
356
+ Copied
357
+
358
+ \>>> import torch
359
+ \>>> from diffusers import CogVideoXImageToVideoPipeline
360
+ \>>> from diffusers.utils import export\_to\_video, load\_image
361
+
362
+ \>>> pipe = CogVideoXImageToVideoPipeline.from\_pretrained("THUDM/CogVideoX-5b-I2V", torch\_dtype=torch.bfloat16)
363
+ \>>> pipe.to("cuda")
364
+
365
+ \>>> prompt = "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
366
+ \>>> image = load\_image(
367
+ ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
368
+ ... )
369
+ \>>> video = pipe(image, prompt, use\_dynamic\_cfg=True)
370
+ \>>> export\_to\_video(video.frames\[0\], "output.mp4", fps=8)
371
+
372
+ #### encode\_prompt
373
+
374
+ [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L267)
375
+
376
+ ( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
377
+
378
+ Parameters
379
+
380
+ * [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” prompt to be encoded
381
+ * [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
382
+ * [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) β€” Whether to use classifier free guidance or not.
383
+ * [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
384
+ * [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
385
+ * [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
386
+ * [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.device)**device** β€” (`torch.device`, _optional_): torch device
387
+ * [](#diffusers.CogVideoXImageToVideoPipeline.encode_prompt.dtype)**dtype** β€” (`torch.dtype`, _optional_): torch dtype
388
+
389
+ Encodes the prompt into text encoder hidden states.
390
+
391
+ #### fuse\_qkv\_projections
392
+
393
+ [](#diffusers.CogVideoXImageToVideoPipeline.fuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L523)
394
+
395
+ ( )
396
+
397
+ Enables fused QKV projections.
398
+
399
+ #### unfuse\_qkv\_projections
400
+
401
+ [](#diffusers.CogVideoXImageToVideoPipeline.unfuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py#L529)
402
+
403
+ ( )
404
+
405
+ Disable QKV projection fusion if enabled.
406
+
407
+ [](#diffusers.CogVideoXVideoToVideoPipeline)CogVideoXVideoToVideoPipeline
408
+ -------------------------------------------------------------------------
409
+
410
+ ### class diffusers.CogVideoXVideoToVideoPipeline
411
+
412
+ [](#diffusers.CogVideoXVideoToVideoPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L169)
413
+
414
+ ( tokenizer: T5Tokenizertext\_encoder: T5EncoderModelvae: AutoencoderKLCogVideoXtransformer: CogVideoXTransformer3DModelscheduler: typing.Union\[diffusers.schedulers.scheduling\_ddim\_cogvideox.CogVideoXDDIMScheduler, diffusers.schedulers.scheduling\_dpm\_cogvideox.CogVideoXDPMScheduler\] )
415
+
416
+ Parameters
417
+
418
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.vae)**vae** ([AutoencoderKL](/docs/diffusers/main/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) β€” Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
419
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) β€” Frozen text-encoder. CogVideoX uses [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel); specifically the [t5-v1\_1-xxl](https://huggingface.co/PixArt-alpha/PixArt-alpha/tree/main/t5-v1_1-xxl) variant.
420
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) β€” Tokenizer of class [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
421
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.transformer)**transformer** ([CogVideoXTransformer3DModel](/docs/diffusers/main/en/api/models/cogvideox_transformer3d#diffusers.CogVideoXTransformer3DModel)) β€” A text conditioned `CogVideoXTransformer3DModel` to denoise the encoded video latents.
422
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.scheduler)**scheduler** ([SchedulerMixin](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin)) β€” A scheduler to be used in combination with `transformer` to denoise the encoded video latents.
423
+
424
+ Pipeline for video-to-video generation using CogVideoX.
425
+
426
+ This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
427
+
428
+ #### \_\_call\_\_
429
+
430
+ [](#diffusers.CogVideoXVideoToVideoPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L575)
431
+
432
+ ( video: typing.List\[PIL.Image.Image\] = Noneprompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Noneheight: typing.Optional\[int\] = Nonewidth: typing.Optional\[int\] = Nonenum\_inference\_steps: int = 50timesteps: typing.Optional\[typing.List\[int\]\] = Nonestrength: float = 0.8guidance\_scale: float = 6use\_dynamic\_cfg: bool = Falsenum\_videos\_per\_prompt: int = 1eta: float = 0.0generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.FloatTensor\] = Noneprompt\_embeds: typing.Optional\[torch.FloatTensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.FloatTensor\] = Noneoutput\_type: str = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 226 ) β†’ export const metadata = 'undefined';[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
433
+
434
+ Expand 20 parameters
435
+
436
+ Parameters
437
+
438
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.video)**video** (`List[PIL.Image.Image]`) β€” The input video to condition the generation on. Must be a list of images/frames of the video.
439
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
440
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
441
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.height)**height** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) β€” The height in pixels of the generated image. This is set to 480 by default for the best results.
442
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.width)**width** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) β€” The width in pixels of the generated image. This is set to 720 by default for the best results.
443
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) β€” The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
444
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) β€” Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
445
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.strength)**strength** (`float`, _optional_, defaults to 0.8) β€” Higher strength leads to more differences between original video and generated video.
446
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, _optional_, defaults to 7.0) β€” Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
447
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” The number of videos to generate per prompt.
448
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) β€” One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
449
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.latents)**latents** (`torch.FloatTensor`, _optional_) β€” Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
450
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.FloatTensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
451
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
452
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) β€” The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
453
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) β€” Whether or not to return a `~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput` instead of a plain tuple.
454
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) β€” A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
455
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) β€” A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
456
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) β€” The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
457
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int`, defaults to `226`) β€” Maximum sequence length in encoded prompt. Must be consistent with `self.transformer.config.max_text_seq_length` otherwise may lead to poor results.
458
+
459
+ Returns
460
+
461
+ export const metadata = 'undefined';
462
+
463
+ [CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
464
+
465
+ export const metadata = 'undefined';
466
+
467
+ [CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) if `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
468
+
469
+ Function invoked when calling the pipeline for generation.
470
+
471
+ [](#diffusers.CogVideoXVideoToVideoPipeline.__call__.example)
472
+
473
+ Examples:
474
+
475
+ Copied
476
+
477
+ \>>> import torch
478
+ \>>> from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
479
+ \>>> from diffusers.utils import export\_to\_video, load\_video
480
+
481
+ \>>> \# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
482
+ \>>> pipe = CogVideoXVideoToVideoPipeline.from\_pretrained("THUDM/CogVideoX-5b", torch\_dtype=torch.bfloat16)
483
+ \>>> pipe.to("cuda")
484
+ \>>> pipe.scheduler = CogVideoXDPMScheduler.from\_config(pipe.scheduler.config)
485
+
486
+ \>>> input\_video = load\_video(
487
+ ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
488
+ ... )
489
+ \>>> prompt = (
490
+ ... "An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and "
491
+ ... "valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in "
492
+ ... "the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, "
493
+ ... "moons, but the remainder of the scene is mostly realistic."
494
+ ... )
495
+
496
+ \>>> video = pipe(
497
+ ... video=input\_video, prompt=prompt, strength=0.8, guidance\_scale=6, num\_inference\_steps=50
498
+ ... ).frames\[0\]
499
+ \>>> export\_to\_video(video, "output.mp4", fps=8)
500
+
501
+ #### encode\_prompt
502
+
503
+ [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L269)
504
+
505
+ ( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
506
+
507
+ Parameters
508
+
509
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” prompt to be encoded
510
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
511
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) β€” Whether to use classifier free guidance or not.
512
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
513
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
514
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
515
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.device)**device** β€” (`torch.device`, _optional_): torch device
516
+ * [](#diffusers.CogVideoXVideoToVideoPipeline.encode_prompt.dtype)**dtype** β€” (`torch.dtype`, _optional_): torch dtype
517
+
518
+ Encodes the prompt into text encoder hidden states.
519
+
520
+ #### fuse\_qkv\_projections
521
+
522
+ [](#diffusers.CogVideoXVideoToVideoPipeline.fuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L496)
523
+
524
+ ( )
525
+
526
+ Enables fused QKV projections.
527
+
528
+ #### unfuse\_qkv\_projections
529
+
530
+ [](#diffusers.CogVideoXVideoToVideoPipeline.unfuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_video2video.py#L502)
531
+
532
+ ( )
533
+
534
+ Disable QKV projection fusion if enabled.
535
+
536
+ [](#diffusers.CogVideoXFunControlPipeline)CogVideoXFunControlPipeline
537
+ ---------------------------------------------------------------------
538
+
539
+ ### class diffusers.CogVideoXFunControlPipeline
540
+
541
+ [](#diffusers.CogVideoXFunControlPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_control.py#L154)
542
+
543
+ ( tokenizer: T5Tokenizertext\_encoder: T5EncoderModelvae: AutoencoderKLCogVideoXtransformer: CogVideoXTransformer3DModelscheduler: KarrasDiffusionSchedulers )
544
+
545
+ Parameters
546
+
547
+ * [](#diffusers.CogVideoXFunControlPipeline.vae)**vae** ([AutoencoderKL](/docs/diffusers/main/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) β€” Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
548
+ * [](#diffusers.CogVideoXFunControlPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) β€” Frozen text-encoder. CogVideoX uses [T5](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5EncoderModel); specifically the [t5-v1\_1-xxl](https://huggingface.co/PixArt-alpha/PixArt-alpha/tree/main/t5-v1_1-xxl) variant.
549
+ * [](#diffusers.CogVideoXFunControlPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) β€” Tokenizer of class [T5Tokenizer](https://huggingface.co/docs/transformers/model_doc/t5#transformers.T5Tokenizer).
550
+ * [](#diffusers.CogVideoXFunControlPipeline.transformer)**transformer** ([CogVideoXTransformer3DModel](/docs/diffusers/main/en/api/models/cogvideox_transformer3d#diffusers.CogVideoXTransformer3DModel)) β€” A text conditioned `CogVideoXTransformer3DModel` to denoise the encoded video latents.
551
+ * [](#diffusers.CogVideoXFunControlPipeline.scheduler)**scheduler** ([SchedulerMixin](/docs/diffusers/main/en/api/schedulers/overview#diffusers.SchedulerMixin)) β€” A scheduler to be used in combination with `transformer` to denoise the encoded video latents.
552
+
553
+ Pipeline for controlled text-to-video generation using CogVideoX Fun.
554
+
555
+ This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
556
+
557
+ #### \_\_call\_\_
558
+
559
+ [](#diffusers.CogVideoXFunControlPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_control.py#L551)
560
+
561
+ ( prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonecontrol\_video: typing.Optional\[typing.List\[PIL.Image.Image\]\] = Noneheight: typing.Optional\[int\] = Nonewidth: typing.Optional\[int\] = Nonenum\_inference\_steps: int = 50timesteps: typing.Optional\[typing.List\[int\]\] = Noneguidance\_scale: float = 6use\_dynamic\_cfg: bool = Falsenum\_videos\_per\_prompt: int = 1eta: float = 0.0generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Nonecontrol\_video\_latents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneoutput\_type: str = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 226 ) β†’ export const metadata = 'undefined';[CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
562
+
563
+ Expand 20 parameters
564
+
565
+ Parameters
566
+
567
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
568
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
569
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.control_video)**control\_video** (`List[PIL.Image.Image]`) β€” The control video to condition the generation on. Must be a list of images/frames of the video. If not provided, `control_video_latents` must be provided.
570
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.height)**height** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) β€” The height in pixels of the generated image. This is set to 480 by default for the best results.
571
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.width)**width** (`int`, _optional_, defaults to self.transformer.config.sample\_height \* self.vae\_scale\_factor\_spatial) β€” The width in pixels of the generated image. This is set to 720 by default for the best results.
572
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) β€” The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
573
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) β€” Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
574
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, _optional_, defaults to 6.0) β€” Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
575
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” The number of videos to generate per prompt.
576
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) β€” One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
577
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) β€” Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for video generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
578
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.control_video_latents)**control\_video\_latents** (`torch.Tensor`, _optional_) β€” Pre-generated control latents, sampled from a Gaussian distribution, to be used as inputs for controlled video generation. If not provided, `control_video` must be provided.
579
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
580
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
581
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) β€” The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
582
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) β€” Whether or not to return a `~pipelines.stable_diffusion_xl.StableDiffusionXLPipelineOutput` instead of a plain tuple.
583
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) β€” A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
584
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) β€” A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
585
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) β€” The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
586
+ * [](#diffusers.CogVideoXFunControlPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int`, defaults to `226`) β€” Maximum sequence length in encoded prompt. Must be consistent with `self.transformer.config.max_text_seq_length` otherwise may lead to poor results.
587
+
588
+ Returns
589
+
590
+ export const metadata = 'undefined';
591
+
592
+ [CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) or `tuple`
593
+
594
+ export const metadata = 'undefined';
595
+
596
+ [CogVideoXPipelineOutput](/docs/diffusers/main/en/api/pipelines/cogvideox#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput) if `return_dict` is True, otherwise a `tuple`. When returning a tuple, the first element is a list with the generated images.
597
+
598
+ Function invoked when calling the pipeline for generation.
599
+
600
+ [](#diffusers.CogVideoXFunControlPipeline.__call__.example)
601
+
602
+ Examples:
603
+
604
+ Copied
605
+
606
+ \>>> import torch
607
+ \>>> from diffusers import CogVideoXFunControlPipeline, DDIMScheduler
608
+ \>>> from diffusers.utils import export\_to\_video, load\_video
609
+
610
+ \>>> pipe = CogVideoXFunControlPipeline.from\_pretrained(
611
+ ... "alibaba-pai/CogVideoX-Fun-V1.1-5b-Pose", torch\_dtype=torch.bfloat16
612
+ ... )
613
+ \>>> pipe.scheduler = DDIMScheduler.from\_config(pipe.scheduler.config)
614
+ \>>> pipe.to("cuda")
615
+
616
+ \>>> control\_video = load\_video(
617
+ ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
618
+ ... )
619
+ \>>> prompt = (
620
+ ... "An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and "
621
+ ... "valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in "
622
+ ... "the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, "
623
+ ... "moons, but the remainder of the scene is mostly realistic."
624
+ ... )
625
+
626
+ \>>> video = pipe(prompt=prompt, control\_video=control\_video).frames\[0\]
627
+ \>>> export\_to\_video(video, "output.mp4", fps=8)
628
+
629
+ #### encode\_prompt
630
+
631
+ [](#diffusers.CogVideoXFunControlPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_control.py#L253)
632
+
633
+ ( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
634
+
635
+ Parameters
636
+
637
+ * [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” prompt to be encoded
638
+ * [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
639
+ * [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) β€” Whether to use classifier free guidance or not.
640
+ * [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
641
+ * [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
642
+ * [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
643
+ * [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.device)**device** β€” (`torch.device`, _optional_): torch device
644
+ * [](#diffusers.CogVideoXFunControlPipeline.encode_prompt.dtype)**dtype** β€” (`torch.dtype`, _optional_): torch dtype
645
+
646
+ Encodes the prompt into text encoder hidden states.
647
+
648
+ #### fuse\_qkv\_projections
649
+
650
+ [](#diffusers.CogVideoXFunControlPipeline.fuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_control.py#L473)
651
+
652
+ ( )
653
+
654
+ Enables fused QKV projections.
655
+
656
+ #### unfuse\_qkv\_projections
657
+
658
+ [](#diffusers.CogVideoXFunControlPipeline.unfuse_qkv_projections)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_cogvideox_fun_control.py#L478)
659
+
660
+ ( )
661
+
662
+ Disable QKV projection fusion if enabled.
663
+
664
+ [](#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput)CogVideoXPipelineOutput
665
+ ------------------------------------------------------------------------------------------------
666
+
667
+ ### class diffusers.pipelines.cogvideo.pipeline\_output.CogVideoXPipelineOutput
668
+
669
+ [](#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/cogvideo/pipeline_output.py#L8)
670
+
671
+ ( frames: Tensor )
672
+
673
+ Parameters
674
+
675
+ * [](#diffusers.pipelines.cogvideo.pipeline_output.CogVideoXPipelineOutput.frames)**frames** (`torch.Tensor`, `np.ndarray`, or List\[List\[PIL.Image.Image\]\]) β€” List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.
676
+
677
+ Output class for CogVideo pipelines.
678
+
679
+ [< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/cogvideox.md)
680
+
681
+ CogVideoX
682
+
683
+ [←BLIP-Diffusion](/docs/diffusers/main/en/api/pipelines/blip_diffusion) [CogView3β†’](/docs/diffusers/main/en/api/pipelines/cogview3)
docs/diffusers/Using Diffusers for HunyuanVideo.md ADDED
@@ -0,0 +1,232 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [](#hunyuanvideo)HunyuanVideo
2
+ =============================
3
+
4
+ ![LoRA](https://img.shields.io/badge/LoRA-d8b4fe?style=flat)
5
+
6
+ [HunyuanVideo](https://www.arxiv.org/abs/2412.03603) by Tencent.
7
+
8
+ _Recent advancements in video generation have significantly impacted daily life for both individuals and industries. However, the leading video generation models remain closed-source, resulting in a notable performance gap between industry capabilities and those available to the public. In this report, we introduce HunyuanVideo, an innovative open-source video foundation model that demonstrates performance in video generation comparable to, or even surpassing, that of leading closed-source models. HunyuanVideo encompasses a comprehensive framework that integrates several key elements, including data curation, advanced architectural design, progressive model scaling and training, and an efficient infrastructure tailored for large-scale model training and inference. As a result, we successfully trained a video generative model with over 13 billion parameters, making it the largest among all open-source models. We conducted extensive experiments and implemented a series of targeted designs to ensure high visual quality, motion dynamics, text-video alignment, and advanced filming techniques. According to evaluations by professionals, HunyuanVideo outperforms previous state-of-the-art models, including Runway Gen-3, Luma 1.6, and three top-performing Chinese video generative models. By releasing the code for the foundation model and its applications, we aim to bridge the gap between closed-source and open-source communities. This initiative will empower individuals within the community to experiment with their ideas, fostering a more dynamic and vibrant video generation ecosystem. The code is publicly available at [this https URL](https://github.com/tencent/HunyuanVideo)._
9
+
10
+ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
11
+
12
+ Recommendations for inference:
13
+
14
+ * Both text encoders should be in `torch.float16`.
15
+ * Transformer should be in `torch.bfloat16`.
16
+ * VAE should be in `torch.float16`.
17
+ * `num_frames` should be of the form `4 * k + 1`, for example `49` or `129`.
18
+ * For smaller resolution videos, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution images, try higher values (between `7.0` and `12.0`). The default value is `7.0` for HunyuanVideo.
19
+ * For more information about supported resolutions and other details, please refer to the original repository [here](https://github.com/Tencent/HunyuanVideo/).
20
+
21
+ [](#available-models)Available models
22
+ -------------------------------------
23
+
24
+ The following models are available for the [`HunyuanVideoPipeline`](text-to-video) pipeline:
25
+
26
+ Model name
27
+
28
+ Description
29
+
30
+ [`hunyuanvideo-community/HunyuanVideo`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo)
31
+
32
+ Official HunyuanVideo (guidance-distilled). Performs best at multiple resolutions and frames. Performs best with `guidance_scale=6.0`, `true_cfg_scale=1.0` and without a negative prompt.
33
+
34
+ [`https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V`](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-T2V)
35
+
36
+ Skywork’s custom finetune of HunyuanVideo (de-distilled). Performs best with `97x544x960` resolution, `guidance_scale=1.0`, `true_cfg_scale=6.0` and a negative prompt.
37
+
38
+ The following models are available for the image-to-video pipeline:
39
+
40
+ Model name
41
+
42
+ Description
43
+
44
+ [`Skywork/SkyReels-V1-Hunyuan-I2V`](https://huggingface.co/Skywork/SkyReels-V1-Hunyuan-I2V)
45
+
46
+ Skywork’s custom finetune of HunyuanVideo (de-distilled). Performs best with `97x544x960` resolution. Performs best at `97x544x960` resolution, `guidance_scale=1.0`, `true_cfg_scale=6.0` and a negative prompt.
47
+
48
+ [`hunyuanvideo-community/HunyuanVideo-I2V`](https://huggingface.co/hunyuanvideo-community/HunyuanVideo-I2V)
49
+
50
+ Tecent’s official HunyuanVideo I2V model. Performs best at resolutions of 480, 720, 960, 1280. A higher `shift` value when initializing the scheduler is recommended (good values are between 7 and 20)
51
+
52
+ [](#quantization)Quantization
53
+ -----------------------------
54
+
55
+ Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
56
+
57
+ Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [HunyuanVideoPipeline](/docs/diffusers/main/en/api/pipelines/hunyuan_video#diffusers.HunyuanVideoPipeline) for inference with bitsandbytes.
58
+
59
+ Copied
60
+
61
+ import torch
62
+ from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, HunyuanVideoTransformer3DModel, HunyuanVideoPipeline
63
+ from diffusers.utils import export\_to\_video
64
+
65
+ quant\_config = DiffusersBitsAndBytesConfig(load\_in\_8bit=True)
66
+ transformer\_8bit = HunyuanVideoTransformer3DModel.from\_pretrained(
67
+ "hunyuanvideo-community/HunyuanVideo",
68
+ subfolder="transformer",
69
+ quantization\_config=quant\_config,
70
+ torch\_dtype=torch.bfloat16,
71
+ )
72
+
73
+ pipeline = HunyuanVideoPipeline.from\_pretrained(
74
+ "hunyuanvideo-community/HunyuanVideo",
75
+ transformer=transformer\_8bit,
76
+ torch\_dtype=torch.float16,
77
+ device\_map="balanced",
78
+ )
79
+
80
+ prompt = "A cat walks on the grass, realistic style."
81
+ video = pipeline(prompt=prompt, num\_frames=61, num\_inference\_steps=30).frames\[0\]
82
+ export\_to\_video(video, "cat.mp4", fps=15)
83
+
84
+ [](#diffusers.HunyuanVideoPipeline)HunyuanVideoPipeline
85
+ -------------------------------------------------------
86
+
87
+ ### class diffusers.HunyuanVideoPipeline
88
+
89
+ [](#diffusers.HunyuanVideoPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L144)
90
+
91
+ ( text\_encoder: LlamaModeltokenizer: LlamaTokenizerFasttransformer: HunyuanVideoTransformer3DModelvae: AutoencoderKLHunyuanVideoscheduler: FlowMatchEulerDiscreteSchedulertext\_encoder\_2: CLIPTextModeltokenizer\_2: CLIPTokenizer )
92
+
93
+ Parameters
94
+
95
+ * [](#diffusers.HunyuanVideoPipeline.text_encoder)**text\_encoder** (`LlamaModel`) β€” [Llava Llama3-8B](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers).
96
+ * [](#diffusers.HunyuanVideoPipeline.tokenizer)**tokenizer** (`LlamaTokenizer`) β€” Tokenizer from [Llava Llama3-8B](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers).
97
+ * [](#diffusers.HunyuanVideoPipeline.transformer)**transformer** ([HunyuanVideoTransformer3DModel](/docs/diffusers/main/en/api/models/hunyuan_video_transformer_3d#diffusers.HunyuanVideoTransformer3DModel)) β€” Conditional Transformer to denoise the encoded image latents.
98
+ * [](#diffusers.HunyuanVideoPipeline.scheduler)**scheduler** ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) β€” A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
99
+ * [](#diffusers.HunyuanVideoPipeline.vae)**vae** ([AutoencoderKLHunyuanVideo](/docs/diffusers/main/en/api/models/autoencoder_kl_hunyuan_video#diffusers.AutoencoderKLHunyuanVideo)) β€” Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
100
+ * [](#diffusers.HunyuanVideoPipeline.text_encoder_2)**text\_encoder\_2** (`CLIPTextModel`) β€” [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
101
+ * [](#diffusers.HunyuanVideoPipeline.tokenizer_2)**tokenizer\_2** (`CLIPTokenizer`) β€” Tokenizer of class [CLIPTokenizer](https://huggingface.co/docs/transformers/en/model_doc/clip#transformers.CLIPTokenizer).
102
+
103
+ Pipeline for text-to-video generation using HunyuanVideo.
104
+
105
+ This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
106
+
107
+ #### \_\_call\_\_
108
+
109
+ [](#diffusers.HunyuanVideoPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L467)
110
+
111
+ ( prompt: typing.Union\[str, typing.List\[str\]\] = Noneprompt\_2: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt\_2: typing.Union\[str, typing.List\[str\]\] = Noneheight: int = 720width: int = 1280num\_frames: int = 129num\_inference\_steps: int = 50sigmas: typing.List\[float\] = Nonetrue\_cfg\_scale: float = 1.0guidance\_scale: float = 6.0num\_videos\_per\_prompt: typing.Optional\[int\] = 1generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Nonepooled\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneprompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_pooled\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Noneoutput\_type: typing.Optional\[str\] = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]prompt\_template: typing.Dict\[str, typing.Any\] = {'template': '<|start\_header\_id|>system<|end\_header\_id|>\\n\\nDescribe the video by detailing the following aspects: 1. The main content and theme of the video.2. The color, shape, size, texture, quantity, text, and spatial relationships of the objects.3. Actions, events, behaviors temporal relationships, physical movement changes of the objects.4. background environment, light, style and atmosphere.5. camera angles, movements, and transitions used in the video:<|eot\_id|><|start\_header\_id|>user<|end\_header\_id|>\\n\\n{}<|eot\_id|>', 'crop\_start': 95}max\_sequence\_length: int = 256 ) β†’ export const metadata = 'undefined';`~HunyuanVideoPipelineOutput` or `tuple`
112
+
113
+ Expand 24 parameters
114
+
115
+ Parameters
116
+
117
+ * [](#diffusers.HunyuanVideoPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
118
+ * [](#diffusers.HunyuanVideoPipeline.__call__.prompt_2)**prompt\_2** (`str` or `List[str]`, _optional_) β€” The prompt or prompts to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `prompt` is will be used instead.
119
+ * [](#diffusers.HunyuanVideoPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `true_cfg_scale` is not greater than `1`).
120
+ * [](#diffusers.HunyuanVideoPipeline.__call__.negative_prompt_2)**negative\_prompt\_2** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation to be sent to `tokenizer_2` and `text_encoder_2`. If not defined, `negative_prompt` is used in all the text-encoders.
121
+ * [](#diffusers.HunyuanVideoPipeline.__call__.height)**height** (`int`, defaults to `720`) β€” The height in pixels of the generated image.
122
+ * [](#diffusers.HunyuanVideoPipeline.__call__.width)**width** (`int`, defaults to `1280`) β€” The width in pixels of the generated image.
123
+ * [](#diffusers.HunyuanVideoPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `129`) β€” The number of frames in the generated video.
124
+ * [](#diffusers.HunyuanVideoPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, defaults to `50`) β€” The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
125
+ * [](#diffusers.HunyuanVideoPipeline.__call__.sigmas)**sigmas** (`List[float]`, _optional_) β€” Custom sigmas to use for the denoising process with schedulers which support a `sigmas` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used.
126
+ * [](#diffusers.HunyuanVideoPipeline.__call__.true_cfg_scale)**true\_cfg\_scale** (`float`, _optional_, defaults to 1.0) β€” When > 1.0 and a provided `negative_prompt`, enables true classifier-free guidance.
127
+ * [](#diffusers.HunyuanVideoPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, defaults to `6.0`) β€” Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality. Note that the only available HunyuanVideo model is CFG-distilled, which means that traditional guidance between unconditional and conditional latent is not applied.
128
+ * [](#diffusers.HunyuanVideoPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” The number of images to generate per prompt.
129
+ * [](#diffusers.HunyuanVideoPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) β€” A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
130
+ * [](#diffusers.HunyuanVideoPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) β€” Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random `generator`.
131
+ * [](#diffusers.HunyuanVideoPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the `prompt` input argument.
132
+ * [](#diffusers.HunyuanVideoPipeline.__call__.pooled_prompt_embeds)**pooled\_prompt\_embeds** (`torch.FloatTensor`, _optional_) β€” Pre-generated pooled text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, pooled text embeddings will be generated from `prompt` input argument.
133
+ * [](#diffusers.HunyuanVideoPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
134
+ * [](#diffusers.HunyuanVideoPipeline.__call__.negative_pooled_prompt_embeds)**negative\_pooled\_prompt\_embeds** (`torch.FloatTensor`, _optional_) β€” Pre-generated negative pooled text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, pooled negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
135
+ * [](#diffusers.HunyuanVideoPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) β€” The output format of the generated image. Choose between `PIL.Image` or `np.array`.
136
+ * [](#diffusers.HunyuanVideoPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) β€” Whether or not to return a `HunyuanVideoPipelineOutput` instead of a plain tuple.
137
+ * [](#diffusers.HunyuanVideoPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) β€” A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
138
+ * [](#diffusers.HunyuanVideoPipeline.__call__.clip_skip)**clip\_skip** (`int`, _optional_) β€” Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that the output of the pre-final layer will be used for computing the prompt embeddings.
139
+ * [](#diffusers.HunyuanVideoPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, _optional_) β€” A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of each denoising step during the inference. with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
140
+ * [](#diffusers.HunyuanVideoPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) β€” The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
141
+
142
+ Returns
143
+
144
+ export const metadata = 'undefined';
145
+
146
+ `~HunyuanVideoPipelineOutput` or `tuple`
147
+
148
+ export const metadata = 'undefined';
149
+
150
+ If `return_dict` is `True`, `HunyuanVideoPipelineOutput` is returned, otherwise a `tuple` is returned where the first element is a list with the generated images and the second element is a list of `bool`s indicating whether the corresponding generated image contains β€œnot-safe-for-work” (nsfw) content.
151
+
152
+ The call function to the pipeline for generation.
153
+
154
+ [](#diffusers.HunyuanVideoPipeline.__call__.example)
155
+
156
+ Examples:
157
+
158
+ Copied
159
+
160
+ \>>> import torch
161
+ \>>> from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
162
+ \>>> from diffusers.utils import export\_to\_video
163
+
164
+ \>>> model\_id = "hunyuanvideo-community/HunyuanVideo"
165
+ \>>> transformer = HunyuanVideoTransformer3DModel.from\_pretrained(
166
+ ... model\_id, subfolder="transformer", torch\_dtype=torch.bfloat16
167
+ ... )
168
+ \>>> pipe = HunyuanVideoPipeline.from\_pretrained(model\_id, transformer=transformer, torch\_dtype=torch.float16)
169
+ \>>> pipe.vae.enable\_tiling()
170
+ \>>> pipe.to("cuda")
171
+
172
+ \>>> output = pipe(
173
+ ... prompt="A cat walks on the grass, realistic",
174
+ ... height=320,
175
+ ... width=512,
176
+ ... num\_frames=61,
177
+ ... num\_inference\_steps=30,
178
+ ... ).frames\[0\]
179
+ \>>> export\_to\_video(output, "output.mp4", fps=15)
180
+
181
+ #### disable\_vae\_slicing
182
+
183
+ [](#diffusers.HunyuanVideoPipeline.disable_vae_slicing)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L425)
184
+
185
+ ( )
186
+
187
+ Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to computing decoding in one step.
188
+
189
+ #### disable\_vae\_tiling
190
+
191
+ [](#diffusers.HunyuanVideoPipeline.disable_vae_tiling)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L440)
192
+
193
+ ( )
194
+
195
+ Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to computing decoding in one step.
196
+
197
+ #### enable\_vae\_slicing
198
+
199
+ [](#diffusers.HunyuanVideoPipeline.enable_vae_slicing)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L418)
200
+
201
+ ( )
202
+
203
+ Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
204
+
205
+ #### enable\_vae\_tiling
206
+
207
+ [](#diffusers.HunyuanVideoPipeline.enable_vae_tiling)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_hunyuan_video.py#L432)
208
+
209
+ ( )
210
+
211
+ Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow processing larger images.
212
+
213
+ [](#diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput)HunyuanVideoPipelineOutput
214
+ -----------------------------------------------------------------------------------------------------------
215
+
216
+ ### class diffusers.pipelines.hunyuan\_video.pipeline\_output.HunyuanVideoPipelineOutput
217
+
218
+ [](#diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/hunyuan_video/pipeline_output.py#L8)
219
+
220
+ ( frames: Tensor )
221
+
222
+ Parameters
223
+
224
+ * [](#diffusers.pipelines.hunyuan_video.pipeline_output.HunyuanVideoPipelineOutput.frames)**frames** (`torch.Tensor`, `np.ndarray`, or List\[List\[PIL.Image.Image\]\]) β€” List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.
225
+
226
+ Output class for HunyuanVideo pipelines.
227
+
228
+ [< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/hunyuan_video.md)
229
+
230
+ LTX Video
231
+
232
+ [←Hunyuan-DiT](/docs/diffusers/main/en/api/pipelines/hunyuandit) [I2VGen-XLβ†’](/docs/diffusers/main/en/api/pipelines/i2vgenxl)
docs/diffusers/Using Diffusers for LTX Video.md ADDED
@@ -0,0 +1,421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [](#ltx-video)LTX Video
2
+ =======================
3
+
4
+ ![LoRA](https://img.shields.io/badge/LoRA-d8b4fe?style=flat)
5
+
6
+ [LTX Video](https://huggingface.co/Lightricks/LTX-Video) is the first DiT-based video generation model capable of generating high-quality videos in real-time. It produces 24 FPS videos at a 768x512 resolution faster than they can be watched. Trained on a large-scale dataset of diverse videos, the model generates high-resolution videos with realistic and varied content. We provide a model for both text-to-video as well as image + text-to-video usecases.
7
+
8
+ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
9
+
10
+ Available models:
11
+
12
+ Model name
13
+
14
+ Recommended dtype
15
+
16
+ [`LTX Video 0.9.0`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.safetensors)
17
+
18
+ `torch.bfloat16`
19
+
20
+ [`LTX Video 0.9.1`](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.1.safetensors)
21
+
22
+ `torch.bfloat16`
23
+
24
+ Note: The recommended dtype is for the transformer component. The VAE and text encoders can be either `torch.float32`, `torch.bfloat16` or `torch.float16` but the recommended dtype is `torch.bfloat16` as used in the original repository.
25
+
26
+ [](#loading-single-files)Loading Single Files
27
+ ---------------------------------------------
28
+
29
+ Loading the original LTX Video checkpoints is also possible with `~ModelMixin.from_single_file`. We recommend using `from_single_file` for the Lightricks series of models, as they plan to release multiple models in the future in the single file format.
30
+
31
+ Copied
32
+
33
+ import torch
34
+ from diffusers import AutoencoderKLLTXVideo, LTXImageToVideoPipeline, LTXVideoTransformer3DModel
35
+
36
+ \# \`single\_file\_url\` could also be https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.1.safetensors
37
+ single\_file\_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
38
+ transformer = LTXVideoTransformer3DModel.from\_single\_file(
39
+ single\_file\_url, torch\_dtype=torch.bfloat16
40
+ )
41
+ vae = AutoencoderKLLTXVideo.from\_single\_file(single\_file\_url, torch\_dtype=torch.bfloat16)
42
+ pipe = LTXImageToVideoPipeline.from\_pretrained(
43
+ "Lightricks/LTX-Video", transformer=transformer, vae=vae, torch\_dtype=torch.bfloat16
44
+ )
45
+
46
+ \# ... inference code ...
47
+
48
+ Alternatively, the pipeline can be used to load the weights with `~FromSingleFileMixin.from_single_file`.
49
+
50
+ Copied
51
+
52
+ import torch
53
+ from diffusers import LTXImageToVideoPipeline
54
+ from transformers import T5EncoderModel, T5Tokenizer
55
+
56
+ single\_file\_url = "https://huggingface.co/Lightricks/LTX-Video/ltx-video-2b-v0.9.safetensors"
57
+ text\_encoder = T5EncoderModel.from\_pretrained(
58
+ "Lightricks/LTX-Video", subfolder="text\_encoder", torch\_dtype=torch.bfloat16
59
+ )
60
+ tokenizer = T5Tokenizer.from\_pretrained(
61
+ "Lightricks/LTX-Video", subfolder="tokenizer", torch\_dtype=torch.bfloat16
62
+ )
63
+ pipe = LTXImageToVideoPipeline.from\_single\_file(
64
+ single\_file\_url, text\_encoder=text\_encoder, tokenizer=tokenizer, torch\_dtype=torch.bfloat16
65
+ )
66
+
67
+ Loading [LTX GGUF checkpoints](https://huggingface.co/city96/LTX-Video-gguf) are also supported:
68
+
69
+ Copied
70
+
71
+ import torch
72
+ from diffusers.utils import export\_to\_video
73
+ from diffusers import LTXPipeline, LTXVideoTransformer3DModel, GGUFQuantizationConfig
74
+
75
+ ckpt\_path = (
76
+ "https://huggingface.co/city96/LTX-Video-gguf/blob/main/ltx-video-2b-v0.9-Q3\_K\_S.gguf"
77
+ )
78
+ transformer = LTXVideoTransformer3DModel.from\_single\_file(
79
+ ckpt\_path,
80
+ quantization\_config=GGUFQuantizationConfig(compute\_dtype=torch.bfloat16),
81
+ torch\_dtype=torch.bfloat16,
82
+ )
83
+ pipe = LTXPipeline.from\_pretrained(
84
+ "Lightricks/LTX-Video",
85
+ transformer=transformer,
86
+ torch\_dtype=torch.bfloat16,
87
+ )
88
+ pipe.enable\_model\_cpu\_offload()
89
+
90
+ prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
91
+ negative\_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
92
+
93
+ video = pipe(
94
+ prompt=prompt,
95
+ negative\_prompt=negative\_prompt,
96
+ width=704,
97
+ height=480,
98
+ num\_frames=161,
99
+ num\_inference\_steps=50,
100
+ ).frames\[0\]
101
+ export\_to\_video(video, "output\_gguf\_ltx.mp4", fps=24)
102
+
103
+ Make sure to read the [documentation on GGUF](../../quantization/gguf) to learn more about our GGUF support.
104
+
105
+ Loading and running inference with [LTX Video 0.9.1](https://huggingface.co/Lightricks/LTX-Video/blob/main/ltx-video-2b-v0.9.1.safetensors) weights.
106
+
107
+ Copied
108
+
109
+ import torch
110
+ from diffusers import LTXPipeline
111
+ from diffusers.utils import export\_to\_video
112
+
113
+ pipe = LTXPipeline.from\_pretrained("a-r-r-o-w/LTX-Video-0.9.1-diffusers", torch\_dtype=torch.bfloat16)
114
+ pipe.to("cuda")
115
+
116
+ prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
117
+ negative\_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
118
+
119
+ video = pipe(
120
+ prompt=prompt,
121
+ negative\_prompt=negative\_prompt,
122
+ width=768,
123
+ height=512,
124
+ num\_frames=161,
125
+ decode\_timestep=0.03,
126
+ decode\_noise\_scale=0.025,
127
+ num\_inference\_steps=50,
128
+ ).frames\[0\]
129
+ export\_to\_video(video, "output.mp4", fps=24)
130
+
131
+ Refer to [this section](https://huggingface.co/docs/diffusers/main/en/api/pipelines/cogvideox#memory-optimization) to learn more about optimizing memory consumption.
132
+
133
+ [](#quantization)Quantization
134
+ -----------------------------
135
+
136
+ Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.
137
+
138
+ Refer to the [Quantization](../../quantization/overview) overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized [LTXPipeline](/docs/diffusers/main/en/api/pipelines/ltx_video#diffusers.LTXPipeline) for inference with bitsandbytes.
139
+
140
+ Copied
141
+
142
+ import torch
143
+ from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, LTXVideoTransformer3DModel, LTXPipeline
144
+ from diffusers.utils import export\_to\_video
145
+ from transformers import BitsAndBytesConfig as BitsAndBytesConfig, T5EncoderModel
146
+
147
+ quant\_config = BitsAndBytesConfig(load\_in\_8bit=True)
148
+ text\_encoder\_8bit = T5EncoderModel.from\_pretrained(
149
+ "Lightricks/LTX-Video",
150
+ subfolder="text\_encoder",
151
+ quantization\_config=quant\_config,
152
+ torch\_dtype=torch.float16,
153
+ )
154
+
155
+ quant\_config = DiffusersBitsAndBytesConfig(load\_in\_8bit=True)
156
+ transformer\_8bit = LTXVideoTransformer3DModel.from\_pretrained(
157
+ "Lightricks/LTX-Video",
158
+ subfolder="transformer",
159
+ quantization\_config=quant\_config,
160
+ torch\_dtype=torch.float16,
161
+ )
162
+
163
+ pipeline = LTXPipeline.from\_pretrained(
164
+ "Lightricks/LTX-Video",
165
+ text\_encoder=text\_encoder\_8bit,
166
+ transformer=transformer\_8bit,
167
+ torch\_dtype=torch.float16,
168
+ device\_map="balanced",
169
+ )
170
+
171
+ prompt = "A detailed wooden toy ship with intricately carved masts and sails is seen gliding smoothly over a plush, blue carpet that mimics the waves of the sea. The ship's hull is painted a rich brown, with tiny windows. The carpet, soft and textured, provides a perfect backdrop, resembling an oceanic expanse. Surrounding the ship are various other toys and children's items, hinting at a playful environment. The scene captures the innocence and imagination of childhood, with the toy ship's journey symbolizing endless adventures in a whimsical, indoor setting."
172
+ video = pipeline(prompt=prompt, num\_frames=161, num\_inference\_steps=50).frames\[0\]
173
+ export\_to\_video(video, "ship.mp4", fps=24)
174
+
175
+ [](#diffusers.LTXPipeline)LTXPipeline
176
+ -------------------------------------
177
+
178
+ ### class diffusers.LTXPipeline
179
+
180
+ [](#diffusers.LTXPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx.py#L143)
181
+
182
+ ( scheduler: FlowMatchEulerDiscreteSchedulervae: AutoencoderKLLTXVideotext\_encoder: T5EncoderModeltokenizer: T5TokenizerFasttransformer: LTXVideoTransformer3DModel )
183
+
184
+ Parameters
185
+
186
+ * [](#diffusers.LTXPipeline.transformer)**transformer** ([LTXVideoTransformer3DModel](/docs/diffusers/main/en/api/models/ltx_video_transformer3d#diffusers.LTXVideoTransformer3DModel)) β€” Conditional Transformer architecture to denoise the encoded video latents.
187
+ * [](#diffusers.LTXPipeline.scheduler)**scheduler** ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) β€” A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
188
+ * [](#diffusers.LTXPipeline.vae)**vae** ([AutoencoderKLLTXVideo](/docs/diffusers/main/en/api/models/autoencoderkl_ltx_video#diffusers.AutoencoderKLLTXVideo)) β€” Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
189
+ * [](#diffusers.LTXPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) β€” [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically the [google/t5-v1\_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
190
+ * [](#diffusers.LTXPipeline.tokenizer)**tokenizer** (`CLIPTokenizer`) β€” Tokenizer of class [CLIPTokenizer](https://huggingface.co/docs/transformers/en/model_doc/clip#transformers.CLIPTokenizer).
191
+ * [](#diffusers.LTXPipeline.tokenizer)**tokenizer** (`T5TokenizerFast`) β€” Second Tokenizer of class [T5TokenizerFast](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5TokenizerFast).
192
+
193
+ Pipeline for text-to-video generation.
194
+
195
+ Reference: [https://github.com/Lightricks/LTX-Video](https://github.com/Lightricks/LTX-Video)
196
+
197
+ #### \_\_call\_\_
198
+
199
+ [](#diffusers.LTXPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx.py#L500)
200
+
201
+ ( prompt: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Noneheight: int = 512width: int = 704num\_frames: int = 161frame\_rate: int = 25num\_inference\_steps: int = 50timesteps: typing.List\[int\] = Noneguidance\_scale: float = 3num\_videos\_per\_prompt: typing.Optional\[int\] = 1generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Noneprompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonedecode\_timestep: typing.Union\[float, typing.List\[float\]\] = 0.0decode\_noise\_scale: typing.Union\[float, typing.List\[float\], NoneType\] = Noneoutput\_type: typing.Optional\[str\] = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Optional\[typing.Callable\[\[int, int, typing.Dict\], NoneType\]\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 128 ) β†’ export const metadata = 'undefined';`~pipelines.ltx.LTXPipelineOutput` or `tuple`
202
+
203
+ Expand 22 parameters
204
+
205
+ Parameters
206
+
207
+ * [](#diffusers.LTXPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
208
+ * [](#diffusers.LTXPipeline.__call__.height)**height** (`int`, defaults to `512`) β€” The height in pixels of the generated image. This is set to 480 by default for the best results.
209
+ * [](#diffusers.LTXPipeline.__call__.width)**width** (`int`, defaults to `704`) β€” The width in pixels of the generated image. This is set to 848 by default for the best results.
210
+ * [](#diffusers.LTXPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `161`) β€” The number of video frames to generate
211
+ * [](#diffusers.LTXPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) β€” The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
212
+ * [](#diffusers.LTXPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) β€” Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
213
+ * [](#diffusers.LTXPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, defaults to `3` ) β€” Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
214
+ * [](#diffusers.LTXPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” The number of videos to generate per prompt.
215
+ * [](#diffusers.LTXPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) β€” One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
216
+ * [](#diffusers.LTXPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) β€” Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
217
+ * [](#diffusers.LTXPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
218
+ * [](#diffusers.LTXPipeline.__call__.prompt_attention_mask)**prompt\_attention\_mask** (`torch.Tensor`, _optional_) β€” Pre-generated attention mask for text embeddings.
219
+ * [](#diffusers.LTXPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) β€” Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
220
+ * [](#diffusers.LTXPipeline.__call__.negative_prompt_attention_mask)**negative\_prompt\_attention\_mask** (`torch.FloatTensor`, _optional_) β€” Pre-generated attention mask for negative text embeddings.
221
+ * [](#diffusers.LTXPipeline.__call__.decode_timestep)**decode\_timestep** (`float`, defaults to `0.0`) β€” The timestep at which generated video is decoded.
222
+ * [](#diffusers.LTXPipeline.__call__.decode_noise_scale)**decode\_noise\_scale** (`float`, defaults to `None`) β€” The interpolation factor between random noise and denoised latents at the decode timestep.
223
+ * [](#diffusers.LTXPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) β€” The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
224
+ * [](#diffusers.LTXPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) β€” Whether or not to return a `~pipelines.ltx.LTXPipelineOutput` instead of a plain tuple.
225
+ * [](#diffusers.LTXPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) β€” A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
226
+ * [](#diffusers.LTXPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) β€” A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
227
+ * [](#diffusers.LTXPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) β€” The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
228
+ * [](#diffusers.LTXPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int` defaults to `128` ) β€” Maximum sequence length to use with the `prompt`.
229
+
230
+ Returns
231
+
232
+ export const metadata = 'undefined';
233
+
234
+ `~pipelines.ltx.LTXPipelineOutput` or `tuple`
235
+
236
+ export const metadata = 'undefined';
237
+
238
+ If `return_dict` is `True`, `~pipelines.ltx.LTXPipelineOutput` is returned, otherwise a `tuple` is returned where the first element is a list with the generated images.
239
+
240
+ Function invoked when calling the pipeline for generation.
241
+
242
+ [](#diffusers.LTXPipeline.__call__.example)
243
+
244
+ Examples:
245
+
246
+ Copied
247
+
248
+ \>>> import torch
249
+ \>>> from diffusers import LTXPipeline
250
+ \>>> from diffusers.utils import export\_to\_video
251
+
252
+ \>>> pipe = LTXPipeline.from\_pretrained("Lightricks/LTX-Video", torch\_dtype=torch.bfloat16)
253
+ \>>> pipe.to("cuda")
254
+
255
+ \>>> prompt = "A woman with long brown hair and light skin smiles at another woman with long blonde hair. The woman with brown hair wears a black jacket and has a small, barely noticeable mole on her right cheek. The camera angle is a close-up, focused on the woman with brown hair's face. The lighting is warm and natural, likely from the setting sun, casting a soft glow on the scene. The scene appears to be real-life footage"
256
+ \>>> negative\_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
257
+
258
+ \>>> video = pipe(
259
+ ... prompt=prompt,
260
+ ... negative\_prompt=negative\_prompt,
261
+ ... width=704,
262
+ ... height=480,
263
+ ... num\_frames=161,
264
+ ... num\_inference\_steps=50,
265
+ ... ).frames\[0\]
266
+ \>>> export\_to\_video(video, "output.mp4", fps=24)
267
+
268
+ #### encode\_prompt
269
+
270
+ [](#diffusers.LTXPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx.py#L256)
271
+
272
+ ( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneprompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 128device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
273
+
274
+ Parameters
275
+
276
+ * [](#diffusers.LTXPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” prompt to be encoded
277
+ * [](#diffusers.LTXPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
278
+ * [](#diffusers.LTXPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) β€” Whether to use classifier free guidance or not.
279
+ * [](#diffusers.LTXPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
280
+ * [](#diffusers.LTXPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
281
+ * [](#diffusers.LTXPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
282
+ * [](#diffusers.LTXPipeline.encode_prompt.device)**device** β€” (`torch.device`, _optional_): torch device
283
+ * [](#diffusers.LTXPipeline.encode_prompt.dtype)**dtype** β€” (`torch.dtype`, _optional_): torch dtype
284
+
285
+ Encodes the prompt into text encoder hidden states.
286
+
287
+ [](#diffusers.LTXImageToVideoPipeline)LTXImageToVideoPipeline
288
+ -------------------------------------------------------------
289
+
290
+ ### class diffusers.LTXImageToVideoPipeline
291
+
292
+ [](#diffusers.LTXImageToVideoPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py#L162)
293
+
294
+ ( scheduler: FlowMatchEulerDiscreteSchedulervae: AutoencoderKLLTXVideotext\_encoder: T5EncoderModeltokenizer: T5TokenizerFasttransformer: LTXVideoTransformer3DModel )
295
+
296
+ Parameters
297
+
298
+ * [](#diffusers.LTXImageToVideoPipeline.transformer)**transformer** ([LTXVideoTransformer3DModel](/docs/diffusers/main/en/api/models/ltx_video_transformer3d#diffusers.LTXVideoTransformer3DModel)) β€” Conditional Transformer architecture to denoise the encoded video latents.
299
+ * [](#diffusers.LTXImageToVideoPipeline.scheduler)**scheduler** ([FlowMatchEulerDiscreteScheduler](/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler)) β€” A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
300
+ * [](#diffusers.LTXImageToVideoPipeline.vae)**vae** ([AutoencoderKLLTXVideo](/docs/diffusers/main/en/api/models/autoencoderkl_ltx_video#diffusers.AutoencoderKLLTXVideo)) β€” Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
301
+ * [](#diffusers.LTXImageToVideoPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) β€” [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically the [google/t5-v1\_1-xxl](https://huggingface.co/google/t5-v1_1-xxl) variant.
302
+ * [](#diffusers.LTXImageToVideoPipeline.tokenizer)**tokenizer** (`CLIPTokenizer`) β€” Tokenizer of class [CLIPTokenizer](https://huggingface.co/docs/transformers/en/model_doc/clip#transformers.CLIPTokenizer).
303
+ * [](#diffusers.LTXImageToVideoPipeline.tokenizer)**tokenizer** (`T5TokenizerFast`) β€” Second Tokenizer of class [T5TokenizerFast](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5TokenizerFast).
304
+
305
+ Pipeline for image-to-video generation.
306
+
307
+ Reference: [https://github.com/Lightricks/LTX-Video](https://github.com/Lightricks/LTX-Video)
308
+
309
+ #### \_\_call\_\_
310
+
311
+ [](#diffusers.LTXImageToVideoPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py#L559)
312
+
313
+ ( image: typing.Union\[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List\[PIL.Image.Image\], typing.List\[numpy.ndarray\], typing.List\[torch.Tensor\]\] = Noneprompt: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Noneheight: int = 512width: int = 704num\_frames: int = 161frame\_rate: int = 25num\_inference\_steps: int = 50timesteps: typing.List\[int\] = Noneguidance\_scale: float = 3num\_videos\_per\_prompt: typing.Optional\[int\] = 1generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Noneprompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonedecode\_timestep: typing.Union\[float, typing.List\[float\]\] = 0.0decode\_noise\_scale: typing.Union\[float, typing.List\[float\], NoneType\] = Noneoutput\_type: typing.Optional\[str\] = 'pil'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Optional\[typing.Callable\[\[int, int, typing.Dict\], NoneType\]\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 128 ) β†’ export const metadata = 'undefined';`~pipelines.ltx.LTXPipelineOutput` or `tuple`
314
+
315
+ Expand 23 parameters
316
+
317
+ Parameters
318
+
319
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.image)**image** (`PipelineImageInput`) β€” The input image to condition the generation on. Must be an image, a list of images or a `torch.Tensor`.
320
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
321
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.height)**height** (`int`, defaults to `512`) β€” The height in pixels of the generated image. This is set to 480 by default for the best results.
322
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.width)**width** (`int`, defaults to `704`) β€” The width in pixels of the generated image. This is set to 848 by default for the best results.
323
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `161`) β€” The number of video frames to generate
324
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, _optional_, defaults to 50) β€” The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
325
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.timesteps)**timesteps** (`List[int]`, _optional_) β€” Custom timesteps to use for the denoising process with schedulers which support a `timesteps` argument in their `set_timesteps` method. If not defined, the default behavior when `num_inference_steps` is passed will be used. Must be in descending order.
326
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, defaults to `3` ) β€” Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
327
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” The number of videos to generate per prompt.
328
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) β€” One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
329
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) β€” Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor will ge generated by sampling using the supplied random `generator`.
330
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
331
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.prompt_attention_mask)**prompt\_attention\_mask** (`torch.Tensor`, _optional_) β€” Pre-generated attention mask for text embeddings.
332
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.FloatTensor`, _optional_) β€” Pre-generated negative text embeddings. For PixArt-Sigma this negative prompt should be "". If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
333
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.negative_prompt_attention_mask)**negative\_prompt\_attention\_mask** (`torch.FloatTensor`, _optional_) β€” Pre-generated attention mask for negative text embeddings.
334
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.decode_timestep)**decode\_timestep** (`float`, defaults to `0.0`) β€” The timestep at which generated video is decoded.
335
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.decode_noise_scale)**decode\_noise\_scale** (`float`, defaults to `None`) β€” The interpolation factor between random noise and denoised latents at the decode timestep.
336
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) β€” The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
337
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) β€” Whether or not to return a `~pipelines.ltx.LTXPipelineOutput` instead of a plain tuple.
338
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) β€” A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
339
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, _optional_) β€” A function that calls at the end of each denoising steps during the inference. The function is called with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
340
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) β€” The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
341
+ * [](#diffusers.LTXImageToVideoPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int` defaults to `128` ) β€” Maximum sequence length to use with the `prompt`.
342
+
343
+ Returns
344
+
345
+ export const metadata = 'undefined';
346
+
347
+ `~pipelines.ltx.LTXPipelineOutput` or `tuple`
348
+
349
+ export const metadata = 'undefined';
350
+
351
+ If `return_dict` is `True`, `~pipelines.ltx.LTXPipelineOutput` is returned, otherwise a `tuple` is returned where the first element is a list with the generated images.
352
+
353
+ Function invoked when calling the pipeline for generation.
354
+
355
+ [](#diffusers.LTXImageToVideoPipeline.__call__.example)
356
+
357
+ Examples:
358
+
359
+ Copied
360
+
361
+ \>>> import torch
362
+ \>>> from diffusers import LTXImageToVideoPipeline
363
+ \>>> from diffusers.utils import export\_to\_video, load\_image
364
+
365
+ \>>> pipe = LTXImageToVideoPipeline.from\_pretrained("Lightricks/LTX-Video", torch\_dtype=torch.bfloat16)
366
+ \>>> pipe.to("cuda")
367
+
368
+ \>>> image = load\_image(
369
+ ... "https://huggingface.co/datasets/a-r-r-o-w/tiny-meme-dataset-captioned/resolve/main/images/8.png"
370
+ ... )
371
+ \>>> prompt = "A young girl stands calmly in the foreground, looking directly at the camera, as a house fire rages in the background. Flames engulf the structure, with smoke billowing into the air. Firefighters in protective gear rush to the scene, a fire truck labeled '38' visible behind them. The girl's neutral expression contrasts sharply with the chaos of the fire, creating a poignant and emotionally charged scene."
372
+ \>>> negative\_prompt = "worst quality, inconsistent motion, blurry, jittery, distorted"
373
+
374
+ \>>> video = pipe(
375
+ ... image=image,
376
+ ... prompt=prompt,
377
+ ... negative\_prompt=negative\_prompt,
378
+ ... width=704,
379
+ ... height=480,
380
+ ... num\_frames=161,
381
+ ... num\_inference\_steps=50,
382
+ ... ).frames\[0\]
383
+ \>>> export\_to\_video(video, "output.mp4", fps=24)
384
+
385
+ #### encode\_prompt
386
+
387
+ [](#diffusers.LTXImageToVideoPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_ltx_image2video.py#L279)
388
+
389
+ ( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneprompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_attention\_mask: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 128device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
390
+
391
+ Parameters
392
+
393
+ * [](#diffusers.LTXImageToVideoPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” prompt to be encoded
394
+ * [](#diffusers.LTXImageToVideoPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
395
+ * [](#diffusers.LTXImageToVideoPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) β€” Whether to use classifier free guidance or not.
396
+ * [](#diffusers.LTXImageToVideoPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
397
+ * [](#diffusers.LTXImageToVideoPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
398
+ * [](#diffusers.LTXImageToVideoPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
399
+ * [](#diffusers.LTXImageToVideoPipeline.encode_prompt.device)**device** β€” (`torch.device`, _optional_): torch device
400
+ * [](#diffusers.LTXImageToVideoPipeline.encode_prompt.dtype)**dtype** β€” (`torch.dtype`, _optional_): torch dtype
401
+
402
+ Encodes the prompt into text encoder hidden states.
403
+
404
+ [](#diffusers.pipelines.ltx.pipeline_output.LTXPipelineOutput)LTXPipelineOutput
405
+ -------------------------------------------------------------------------------
406
+
407
+ ### class diffusers.pipelines.ltx.pipeline\_output.LTXPipelineOutput
408
+
409
+ [](#diffusers.pipelines.ltx.pipeline_output.LTXPipelineOutput)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/ltx/pipeline_output.py#L8)
410
+
411
+ ( frames: Tensor )
412
+
413
+ Parameters
414
+
415
+ * [](#diffusers.pipelines.ltx.pipeline_output.LTXPipelineOutput.frames)**frames** (`torch.Tensor`, `np.ndarray`, or List\[List\[PIL.Image.Image\]\]) β€” List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.
416
+
417
+ Output class for LTX pipelines.
418
+
419
+ [< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/ltx_video.md)
420
+
421
+ [←LEDITS++](/docs/diffusers/main/en/api/pipelines/ledits_pp) [Lumina 2.0β†’](/docs/diffusers/main/en/api/pipelines/lumina2)
docs/diffusers/Using Diffusers for Wan.md ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [](#wan)Wan
2
+ ===========
3
+
4
+ [Wan 2.1](https://github.com/Wan-Video/Wan2.1) by the Alibaba Wan Team.
5
+
6
+ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
7
+
8
+ Recommendations for inference:
9
+
10
+ * VAE in `torch.float32` for better decoding quality.
11
+ * `num_frames` should be of the form `4 * k + 1`, for example `49` or `81`.
12
+ * For smaller resolution videos, try lower values of `shift` (between `2.0` to `5.0`) in the [Scheduler](https://huggingface.co/docs/diffusers/main/en/api/schedulers/flow_match_euler_discrete#diffusers.FlowMatchEulerDiscreteScheduler.shift). For larger resolution videos, try higher values (between `7.0` and `12.0`). The default value is `3.0` for Wan.
13
+
14
+ ### [](#using-a-custom-scheduler)Using a custom scheduler
15
+
16
+ Wan can be used with many different schedulers, each with their own benefits regarding speed and generation quality. By default, Wan uses the `UniPCMultistepScheduler(prediction_type="flow_prediction", use_flow_sigmas=True, flow_shift=3.0)` scheduler. You can use a different scheduler as follows:
17
+
18
+ Copied
19
+
20
+ from diffusers import FlowMatchEulerDiscreteScheduler, UniPCMultistepScheduler, WanPipeline
21
+
22
+ scheduler\_a = FlowMatchEulerDiscreteScheduler(shift=5.0)
23
+ scheduler\_b = UniPCMultistepScheduler(prediction\_type="flow\_prediction", use\_flow\_sigmas=True, flow\_shift=4.0)
24
+
25
+ pipe = WanPipeline.from\_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", scheduler=<CUSTOM\_SCHEDULER\_HERE>)
26
+
27
+ \# or,
28
+ pipe.scheduler = <CUSTOM\_SCHEDULER\_HERE>
29
+
30
+ ### [](#using-single-file-loading-with-wan)Using single file loading with Wan
31
+
32
+ The `WanTransformer3DModel` and `AutoencoderKLWan` models support loading checkpoints in their original format via the `from_single_file` loading method.
33
+
34
+ Copied
35
+
36
+ import torch
37
+ from diffusers import WanPipeline, WanTransformer3DModel
38
+
39
+ ckpt\_path = "https://huggingface.co/Comfy-Org/Wan\_2.1\_ComfyUI\_repackaged/blob/main/split\_files/diffusion\_models/wan2.1\_t2v\_1.3B\_bf16.safetensors"
40
+ transformer = WanTransformer3DModel.from\_single\_file(ckpt\_path, torch\_dtype=torch.bfloat16)
41
+
42
+ pipe = WanPipeline.from\_pretrained("Wan-AI/Wan2.1-T2V-1.3B-Diffusers", transformer=transformer)
43
+
44
+ [](#diffusers.WanPipeline)WanPipeline
45
+ -------------------------------------
46
+
47
+ ### class diffusers.WanPipeline
48
+
49
+ [](#diffusers.WanPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan.py#L93)
50
+
51
+ ( tokenizer: AutoTokenizertext\_encoder: UMT5EncoderModeltransformer: WanTransformer3DModelvae: AutoencoderKLWanscheduler: FlowMatchEulerDiscreteScheduler )
52
+
53
+ Parameters
54
+
55
+ * [](#diffusers.WanPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) β€” Tokenizer from [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5Tokenizer), specifically the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant.
56
+ * [](#diffusers.WanPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) β€” [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant.
57
+ * [](#diffusers.WanPipeline.transformer)**transformer** ([WanTransformer3DModel](/docs/diffusers/main/en/api/models/wan_transformer_3d#diffusers.WanTransformer3DModel)) β€” Conditional Transformer to denoise the input latents.
58
+ * [](#diffusers.WanPipeline.scheduler)**scheduler** ([UniPCMultistepScheduler](/docs/diffusers/main/en/api/schedulers/unipc#diffusers.UniPCMultistepScheduler)) β€” A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
59
+ * [](#diffusers.WanPipeline.vae)**vae** ([AutoencoderKLWan](/docs/diffusers/main/en/api/models/autoencoder_kl_wan#diffusers.AutoencoderKLWan)) β€” Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
60
+
61
+ Pipeline for text-to-video generation using Wan.
62
+
63
+ This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
64
+
65
+ #### \_\_call\_\_
66
+
67
+ [](#diffusers.WanPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan.py#L359)
68
+
69
+ ( prompt: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\]\] = Noneheight: int = 480width: int = 832num\_frames: int = 81num\_inference\_steps: int = 50guidance\_scale: float = 5.0num\_videos\_per\_prompt: typing.Optional\[int\] = 1generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneoutput\_type: typing.Optional\[str\] = 'np'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 512 ) β†’ export const metadata = 'undefined';`~WanPipelineOutput` or `tuple`
70
+
71
+ Expand 16 parameters
72
+
73
+ Parameters
74
+
75
+ * [](#diffusers.WanPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
76
+ * [](#diffusers.WanPipeline.__call__.height)**height** (`int`, defaults to `480`) β€” The height in pixels of the generated image.
77
+ * [](#diffusers.WanPipeline.__call__.width)**width** (`int`, defaults to `832`) β€” The width in pixels of the generated image.
78
+ * [](#diffusers.WanPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `81`) β€” The number of frames in the generated video.
79
+ * [](#diffusers.WanPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, defaults to `50`) β€” The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
80
+ * [](#diffusers.WanPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, defaults to `5.0`) β€” Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
81
+ * [](#diffusers.WanPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” The number of images to generate per prompt.
82
+ * [](#diffusers.WanPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) β€” A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
83
+ * [](#diffusers.WanPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) β€” Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random `generator`.
84
+ * [](#diffusers.WanPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the `prompt` input argument.
85
+ * [](#diffusers.WanPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) β€” The output format of the generated image. Choose between `PIL.Image` or `np.array`.
86
+ * [](#diffusers.WanPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) β€” Whether or not to return a `WanPipelineOutput` instead of a plain tuple.
87
+ * [](#diffusers.WanPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) β€” A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
88
+ * [](#diffusers.WanPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, _optional_) β€” A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of each denoising step during the inference. with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
89
+ * [](#diffusers.WanPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) β€” The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
90
+ * [](#diffusers.WanPipeline.__call__.autocast_dtype)**autocast\_dtype** (`torch.dtype`, _optional_, defaults to `torch.bfloat16`) β€” The dtype to use for the torch.amp.autocast.
91
+
92
+ Returns
93
+
94
+ export const metadata = 'undefined';
95
+
96
+ `~WanPipelineOutput` or `tuple`
97
+
98
+ export const metadata = 'undefined';
99
+
100
+ If `return_dict` is `True`, `WanPipelineOutput` is returned, otherwise a `tuple` is returned where the first element is a list with the generated images and the second element is a list of `bool`s indicating whether the corresponding generated image contains β€œnot-safe-for-work” (nsfw) content.
101
+
102
+ The call function to the pipeline for generation.
103
+
104
+ [](#diffusers.WanPipeline.__call__.example)
105
+
106
+ Examples:
107
+
108
+ Copied
109
+
110
+ \>>> import torch
111
+ \>>> from diffusers.utils import export\_to\_video
112
+ \>>> from diffusers import AutoencoderKLWan, WanPipeline
113
+ \>>> from diffusers.schedulers.scheduling\_unipc\_multistep import UniPCMultistepScheduler
114
+
115
+ \>>> \# Available models: Wan-AI/Wan2.1-T2V-14B-Diffusers, Wan-AI/Wan2.1-T2V-1.3B-Diffusers
116
+ \>>> model\_id = "Wan-AI/Wan2.1-T2V-14B-Diffusers"
117
+ \>>> vae = AutoencoderKLWan.from\_pretrained(model\_id, subfolder="vae", torch\_dtype=torch.float32)
118
+ \>>> pipe = WanPipeline.from\_pretrained(model\_id, vae=vae, torch\_dtype=torch.bfloat16)
119
+ \>>> flow\_shift = 5.0 \# 5.0 for 720P, 3.0 for 480P
120
+ \>>> pipe.scheduler = UniPCMultistepScheduler.from\_config(pipe.scheduler.config, flow\_shift=flow\_shift)
121
+ \>>> pipe.to("cuda")
122
+
123
+ \>>> prompt = "A cat and a dog baking a cake together in a kitchen. The cat is carefully measuring flour, while the dog is stirring the batter with a wooden spoon. The kitchen is cozy, with sunlight streaming through the window."
124
+ \>>> negative\_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
125
+
126
+ \>>> output = pipe(
127
+ ... prompt=prompt,
128
+ ... negative\_prompt=negative\_prompt,
129
+ ... height=720,
130
+ ... width=1280,
131
+ ... num\_frames=81,
132
+ ... guidance\_scale=5.0,
133
+ ... ).frames\[0\]
134
+ \>>> export\_to\_video(output, "output.mp4", fps=16)
135
+
136
+ #### encode\_prompt
137
+
138
+ [](#diffusers.WanPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan.py#L181)
139
+
140
+ ( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
141
+
142
+ Parameters
143
+
144
+ * [](#diffusers.WanPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” prompt to be encoded
145
+ * [](#diffusers.WanPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
146
+ * [](#diffusers.WanPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) β€” Whether to use classifier free guidance or not.
147
+ * [](#diffusers.WanPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
148
+ * [](#diffusers.WanPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
149
+ * [](#diffusers.WanPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
150
+ * [](#diffusers.WanPipeline.encode_prompt.device)**device** β€” (`torch.device`, _optional_): torch device
151
+ * [](#diffusers.WanPipeline.encode_prompt.dtype)**dtype** β€” (`torch.dtype`, _optional_): torch dtype
152
+
153
+ Encodes the prompt into text encoder hidden states.
154
+
155
+ [](#diffusers.WanImageToVideoPipeline)WanImageToVideoPipeline
156
+ -------------------------------------------------------------
157
+
158
+ ### class diffusers.WanImageToVideoPipeline
159
+
160
+ [](#diffusers.WanImageToVideoPipeline)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan_i2v.py#L124)
161
+
162
+ ( tokenizer: AutoTokenizertext\_encoder: UMT5EncoderModelimage\_encoder: CLIPVisionModelimage\_processor: CLIPImageProcessortransformer: WanTransformer3DModelvae: AutoencoderKLWanscheduler: FlowMatchEulerDiscreteScheduler )
163
+
164
+ Parameters
165
+
166
+ * [](#diffusers.WanImageToVideoPipeline.tokenizer)**tokenizer** (`T5Tokenizer`) β€” Tokenizer from [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5Tokenizer), specifically the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant.
167
+ * [](#diffusers.WanImageToVideoPipeline.text_encoder)**text\_encoder** (`T5EncoderModel`) β€” [T5](https://huggingface.co/docs/transformers/en/model_doc/t5#transformers.T5EncoderModel), specifically the [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) variant.
168
+ * [](#diffusers.WanImageToVideoPipeline.image_encoder)**image\_encoder** (`CLIPVisionModel`) β€” [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPVisionModel), specifically the [clip-vit-huge-patch14](https://github.com/mlfoundations/open_clip/blob/main/docs/PRETRAINED.md#vit-h14-xlm-roberta-large) variant.
169
+ * [](#diffusers.WanImageToVideoPipeline.transformer)**transformer** ([WanTransformer3DModel](/docs/diffusers/main/en/api/models/wan_transformer_3d#diffusers.WanTransformer3DModel)) β€” Conditional Transformer to denoise the input latents.
170
+ * [](#diffusers.WanImageToVideoPipeline.scheduler)**scheduler** ([UniPCMultistepScheduler](/docs/diffusers/main/en/api/schedulers/unipc#diffusers.UniPCMultistepScheduler)) β€” A scheduler to be used in combination with `transformer` to denoise the encoded image latents.
171
+ * [](#diffusers.WanImageToVideoPipeline.vae)**vae** ([AutoencoderKLWan](/docs/diffusers/main/en/api/models/autoencoder_kl_wan#diffusers.AutoencoderKLWan)) β€” Variational Auto-Encoder (VAE) Model to encode and decode videos to and from latent representations.
172
+
173
+ Pipeline for image-to-video generation using Wan.
174
+
175
+ This model inherits from [DiffusionPipeline](/docs/diffusers/main/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods implemented for all pipelines (downloading, saving, running on a particular device, etc.).
176
+
177
+ #### \_\_call\_\_
178
+
179
+ [](#diffusers.WanImageToVideoPipeline.__call__)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan_i2v.py#L441)
180
+
181
+ ( image: typing.Union\[PIL.Image.Image, numpy.ndarray, torch.Tensor, typing.List\[PIL.Image.Image\], typing.List\[numpy.ndarray\], typing.List\[torch.Tensor\]\]prompt: typing.Union\[str, typing.List\[str\]\] = Nonenegative\_prompt: typing.Union\[str, typing.List\[str\]\] = Noneheight: int = 480width: int = 832num\_frames: int = 81num\_inference\_steps: int = 50guidance\_scale: float = 5.0num\_videos\_per\_prompt: typing.Optional\[int\] = 1generator: typing.Union\[torch.\_C.Generator, typing.List\[torch.\_C.Generator\], NoneType\] = Nonelatents: typing.Optional\[torch.Tensor\] = Noneprompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Noneoutput\_type: typing.Optional\[str\] = 'np'return\_dict: bool = Trueattention\_kwargs: typing.Optional\[typing.Dict\[str, typing.Any\]\] = Nonecallback\_on\_step\_end: typing.Union\[typing.Callable\[\[int, int, typing.Dict\], NoneType\], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType\] = Nonecallback\_on\_step\_end\_tensor\_inputs: typing.List\[str\] = \['latents'\]max\_sequence\_length: int = 512 ) β†’ export const metadata = 'undefined';`~WanPipelineOutput` or `tuple`
182
+
183
+ Expand 20 parameters
184
+
185
+ Parameters
186
+
187
+ * [](#diffusers.WanImageToVideoPipeline.__call__.image)**image** (`PipelineImageInput`) β€” The input image to condition the generation on. Must be an image, a list of images or a `torch.Tensor`.
188
+ * [](#diffusers.WanImageToVideoPipeline.__call__.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`. instead.
189
+ * [](#diffusers.WanImageToVideoPipeline.__call__.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
190
+ * [](#diffusers.WanImageToVideoPipeline.__call__.height)**height** (`int`, defaults to `480`) β€” The height of the generated video.
191
+ * [](#diffusers.WanImageToVideoPipeline.__call__.width)**width** (`int`, defaults to `832`) β€” The width of the generated video.
192
+ * [](#diffusers.WanImageToVideoPipeline.__call__.num_frames)**num\_frames** (`int`, defaults to `81`) β€” The number of frames in the generated video.
193
+ * [](#diffusers.WanImageToVideoPipeline.__call__.num_inference_steps)**num\_inference\_steps** (`int`, defaults to `50`) β€” The number of denoising steps. More denoising steps usually lead to a higher quality image at the expense of slower inference.
194
+ * [](#diffusers.WanImageToVideoPipeline.__call__.guidance_scale)**guidance\_scale** (`float`, defaults to `5.0`) β€” Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598). `guidance_scale` is defined as `w` of equation 2. of [Imagen Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale > 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`, usually at the expense of lower image quality.
195
+ * [](#diffusers.WanImageToVideoPipeline.__call__.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” The number of images to generate per prompt.
196
+ * [](#diffusers.WanImageToVideoPipeline.__call__.generator)**generator** (`torch.Generator` or `List[torch.Generator]`, _optional_) β€” A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make generation deterministic.
197
+ * [](#diffusers.WanImageToVideoPipeline.__call__.latents)**latents** (`torch.Tensor`, _optional_) β€” Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image generation. Can be used to tweak the same generation with different prompts. If not provided, a latents tensor is generated by sampling using the supplied random `generator`.
198
+ * [](#diffusers.WanImageToVideoPipeline.__call__.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not provided, text embeddings are generated from the `prompt` input argument.
199
+ * [](#diffusers.WanImageToVideoPipeline.__call__.output_type)**output\_type** (`str`, _optional_, defaults to `"pil"`) β€” The output format of the generated image. Choose between `PIL.Image` or `np.array`.
200
+ * [](#diffusers.WanImageToVideoPipeline.__call__.return_dict)**return\_dict** (`bool`, _optional_, defaults to `True`) β€” Whether or not to return a `WanPipelineOutput` instead of a plain tuple.
201
+ * [](#diffusers.WanImageToVideoPipeline.__call__.attention_kwargs)**attention\_kwargs** (`dict`, _optional_) β€” A kwargs dictionary that if specified is passed along to the `AttentionProcessor` as defined under `self.processor` in [diffusers.models.attention\_processor](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py).
202
+ * [](#diffusers.WanImageToVideoPipeline.__call__.callback_on_step_end)**callback\_on\_step\_end** (`Callable`, `PipelineCallback`, `MultiPipelineCallbacks`, _optional_) β€” A function or a subclass of `PipelineCallback` or `MultiPipelineCallbacks` that is called at the end of each denoising step during the inference. with the following arguments: `callback_on_step_end(self: DiffusionPipeline, step: int, timestep: int, callback_kwargs: Dict)`. `callback_kwargs` will include a list of all tensors as specified by `callback_on_step_end_tensor_inputs`.
203
+ * [](#diffusers.WanImageToVideoPipeline.__call__.callback_on_step_end_tensor_inputs)**callback\_on\_step\_end\_tensor\_inputs** (`List`, _optional_) β€” The list of tensor inputs for the `callback_on_step_end` function. The tensors specified in the list will be passed as `callback_kwargs` argument. You will only be able to include variables listed in the `._callback_tensor_inputs` attribute of your pipeline class.
204
+ * [](#diffusers.WanImageToVideoPipeline.__call__.max_sequence_length)**max\_sequence\_length** (`int`, _optional_, defaults to `512`) β€” The maximum sequence length of the prompt.
205
+ * [](#diffusers.WanImageToVideoPipeline.__call__.shift)**shift** (`float`, _optional_, defaults to `5.0`) β€” The shift of the flow.
206
+ * [](#diffusers.WanImageToVideoPipeline.__call__.autocast_dtype)**autocast\_dtype** (`torch.dtype`, _optional_, defaults to `torch.bfloat16`) β€” The dtype to use for the torch.amp.autocast.
207
+
208
+ Returns
209
+
210
+ export const metadata = 'undefined';
211
+
212
+ `~WanPipelineOutput` or `tuple`
213
+
214
+ export const metadata = 'undefined';
215
+
216
+ If `return_dict` is `True`, `WanPipelineOutput` is returned, otherwise a `tuple` is returned where the first element is a list with the generated images and the second element is a list of `bool`s indicating whether the corresponding generated image contains β€œnot-safe-for-work” (nsfw) content.
217
+
218
+ The call function to the pipeline for generation.
219
+
220
+ [](#diffusers.WanImageToVideoPipeline.__call__.example)
221
+
222
+ Examples:
223
+
224
+ Copied
225
+
226
+ \>>> import torch
227
+ \>>> import numpy as np
228
+ \>>> from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
229
+ \>>> from diffusers.utils import export\_to\_video, load\_image
230
+ \>>> from transformers import CLIPVisionModel
231
+
232
+ \>>> \# Available models: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
233
+ \>>> model\_id = "Wan-AI/Wan2.1-I2V-14B-480P-Diffusers"
234
+ \>>> image\_encoder = CLIPVisionModel.from\_pretrained(
235
+ ... model\_id, subfolder="image\_encoder", torch\_dtype=torch.float32
236
+ ... )
237
+ \>>> vae = AutoencoderKLWan.from\_pretrained(model\_id, subfolder="vae", torch\_dtype=torch.float32)
238
+ \>>> pipe = WanImageToVideoPipeline.from\_pretrained(
239
+ ... model\_id, vae=vae, image\_encoder=image\_encoder, torch\_dtype=torch.bfloat16
240
+ ... )
241
+ \>>> pipe.to("cuda")
242
+
243
+ \>>> image = load\_image(
244
+ ... "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
245
+ ... )
246
+ \>>> max\_area = 480 \* 832
247
+ \>>> aspect\_ratio = image.height / image.width
248
+ \>>> mod\_value = pipe.vae\_scale\_factor\_spatial \* pipe.transformer.config.patch\_size\[1\]
249
+ \>>> height = round(np.sqrt(max\_area \* aspect\_ratio)) // mod\_value \* mod\_value
250
+ \>>> width = round(np.sqrt(max\_area / aspect\_ratio)) // mod\_value \* mod\_value
251
+ \>>> image = image.resize((width, height))
252
+ \>>> prompt = (
253
+ ... "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in "
254
+ ... "the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
255
+ ... )
256
+ \>>> negative\_prompt = "Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality, low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured, misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards"
257
+
258
+ \>>> output = pipe(
259
+ ... image=image,
260
+ ... prompt=prompt,
261
+ ... negative\_prompt=negative\_prompt,
262
+ ... height=height,
263
+ ... width=width,
264
+ ... num\_frames=81,
265
+ ... guidance\_scale=5.0,
266
+ ... ).frames\[0\]
267
+ \>>> export\_to\_video(output, "output.mp4", fps=16)
268
+
269
+ #### encode\_prompt
270
+
271
+ [](#diffusers.WanImageToVideoPipeline.encode_prompt)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_wan_i2v.py#L228)
272
+
273
+ ( prompt: typing.Union\[str, typing.List\[str\]\]negative\_prompt: typing.Union\[str, typing.List\[str\], NoneType\] = Nonedo\_classifier\_free\_guidance: bool = Truenum\_videos\_per\_prompt: int = 1prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonenegative\_prompt\_embeds: typing.Optional\[torch.Tensor\] = Nonemax\_sequence\_length: int = 226device: typing.Optional\[torch.device\] = Nonedtype: typing.Optional\[torch.dtype\] = None )
274
+
275
+ Parameters
276
+
277
+ * [](#diffusers.WanImageToVideoPipeline.encode_prompt.prompt)**prompt** (`str` or `List[str]`, _optional_) β€” prompt to be encoded
278
+ * [](#diffusers.WanImageToVideoPipeline.encode_prompt.negative_prompt)**negative\_prompt** (`str` or `List[str]`, _optional_) β€” The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
279
+ * [](#diffusers.WanImageToVideoPipeline.encode_prompt.do_classifier_free_guidance)**do\_classifier\_free\_guidance** (`bool`, _optional_, defaults to `True`) β€” Whether to use classifier free guidance or not.
280
+ * [](#diffusers.WanImageToVideoPipeline.encode_prompt.num_videos_per_prompt)**num\_videos\_per\_prompt** (`int`, _optional_, defaults to 1) β€” Number of videos that should be generated per prompt. torch device to place the resulting embeddings on
281
+ * [](#diffusers.WanImageToVideoPipeline.encode_prompt.prompt_embeds)**prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.
282
+ * [](#diffusers.WanImageToVideoPipeline.encode_prompt.negative_prompt_embeds)**negative\_prompt\_embeds** (`torch.Tensor`, _optional_) β€” Pre-generated negative text embeddings. Can be used to easily tweak text inputs, _e.g._ prompt weighting. If not provided, negative\_prompt\_embeds will be generated from `negative_prompt` input argument.
283
+ * [](#diffusers.WanImageToVideoPipeline.encode_prompt.device)**device** β€” (`torch.device`, _optional_): torch device
284
+ * [](#diffusers.WanImageToVideoPipeline.encode_prompt.dtype)**dtype** β€” (`torch.dtype`, _optional_): torch dtype
285
+
286
+ Encodes the prompt into text encoder hidden states.
287
+
288
+ [](#diffusers.pipelines.wan.pipeline_output.WanPipelineOutput)WanPipelineOutput
289
+ -------------------------------------------------------------------------------
290
+
291
+ ### class diffusers.pipelines.wan.pipeline\_output.WanPipelineOutput
292
+
293
+ [](#diffusers.pipelines.wan.pipeline_output.WanPipelineOutput)[< source \>](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/wan/pipeline_output.py#L8)
294
+
295
+ ( frames: Tensor )
296
+
297
+ Parameters
298
+
299
+ * [](#diffusers.pipelines.wan.pipeline_output.WanPipelineOutput.frames)**frames** (`torch.Tensor`, `np.ndarray`, or List\[List\[PIL.Image.Image\]\]) β€” List of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.
300
+
301
+ Output class for Wan pipelines.
302
+
303
+ [< \> Update on GitHub](https://github.com/huggingface/diffusers/blob/main/docs/source/en/api/pipelines/wan.md)
304
+
305
+ CogVideoX
306
+
307
+ [←Value-guided sampling](/docs/diffusers/main/en/api/pipelines/value_guided_sampling) [Wuerstchenβ†’](/docs/diffusers/main/en/api/pipelines/wuerstchen)
finetrainers/args.py CHANGED
@@ -447,7 +447,7 @@ class BaseArgs:
447
  }
448
 
449
  training_arguments = {
450
- "training_type": self.training_type,
451
  "seed": self.seed,
452
  "batch_size": self.batch_size,
453
  "train_steps": self.train_steps,
 
447
  }
448
 
449
  training_arguments = {
450
+ "training_type":self.training_type,
451
  "seed": self.seed,
452
  "batch_size": self.batch_size,
453
  "train_steps": self.train_steps,
vms/config.py CHANGED
@@ -497,7 +497,7 @@ class TrainingConfig:
497
  args.extend(["--flow_mode_scale", str(self.flow_mode_scale)])
498
 
499
  # Training arguments
500
- args.extend(["--training_type", self.training_type])
501
  args.extend(["--seed", str(self.seed)])
502
 
503
  # We don't use this, because mixed precision is handled by accelerate launch, not by the training script itself.
@@ -507,7 +507,7 @@ class TrainingConfig:
507
  args.extend(["--train_steps", str(self.train_steps)])
508
 
509
  # LoRA specific arguments
510
- if self.training_type == "lora":
511
  args.extend(["--rank", str(self.lora_rank)])
512
  args.extend(["--lora_alpha", str(self.lora_alpha)])
513
  args.extend(["--target_modules"] + self.target_modules)
 
497
  args.extend(["--flow_mode_scale", str(self.flow_mode_scale)])
498
 
499
  # Training arguments
500
+ args.extend(["--training_type",self.training_type])
501
  args.extend(["--seed", str(self.seed)])
502
 
503
  # We don't use this, because mixed precision is handled by accelerate launch, not by the training script itself.
 
507
  args.extend(["--train_steps", str(self.train_steps)])
508
 
509
  # LoRA specific arguments
510
+ ifself.training_type == "lora":
511
  args.extend(["--rank", str(self.lora_rank)])
512
  args.extend(["--lora_alpha", str(self.lora_alpha)])
513
  args.extend(["--target_modules"] + self.target_modules)
vms/services/__init__.py CHANGED
@@ -1,14 +1,16 @@
1
- from .captioner import CaptioningProgress, CaptioningService
2
- from .importer import ImportService
3
  from .monitoring import MonitoringService
4
- from .splitter import SplittingService
5
- from .trainer import TrainingService
 
6
 
7
  __all__ = [
8
  'CaptioningProgress',
9
  'CaptioningService',
10
- 'ImportService',
11
  'MonitoringService',
12
  'SplittingService',
 
13
  'TrainingService',
14
  ]
 
1
+ from .captioning import CaptioningProgress, CaptioningService
2
+ from .importing import ImportingService
3
  from .monitoring import MonitoringService
4
+ from .splitting import SplittingService
5
+ from .previewing import PreviewingService
6
+ from .training import TrainingService
7
 
8
  __all__ = [
9
  'CaptioningProgress',
10
  'CaptioningService',
11
+ 'ImportingService',
12
  'MonitoringService',
13
  'SplittingService',
14
+ 'PreviewingService',
15
  'TrainingService',
16
  ]
vms/services/{captioner.py β†’ captioning.py} RENAMED
File without changes
vms/services/{importer β†’ importing}/__init__.py RENAMED
@@ -3,9 +3,9 @@ Import module for Video Model Studio.
3
  Handles file uploads, YouTube downloads, and Hugging Face Hub dataset integration.
4
  """
5
 
6
- from .import_service import ImportService
7
  from .file_upload import FileUploadHandler
8
  from .youtube import YouTubeDownloader
9
  from .hub_dataset import HubDatasetBrowser
10
 
11
- __all__ = ['ImportService', 'FileUploadHandler', 'YouTubeDownloader', 'HubDatasetBrowser']
 
3
  Handles file uploads, YouTube downloads, and Hugging Face Hub dataset integration.
4
  """
5
 
6
+ from .import_service import ImportingService
7
  from .file_upload import FileUploadHandler
8
  from .youtube import YouTubeDownloader
9
  from .hub_dataset import HubDatasetBrowser
10
 
11
+ __all__ = ['ImportingService', 'FileUploadHandler', 'YouTubeDownloader', 'HubDatasetBrowser']
vms/services/{importer β†’ importing}/file_upload.py RENAMED
File without changes
vms/services/{importer β†’ importing}/hub_dataset.py RENAMED
File without changes
vms/services/{importer β†’ importing}/import_service.py RENAMED
@@ -17,7 +17,7 @@ from vms.config import HF_API_TOKEN
17
 
18
  logger = logging.getLogger(__name__)
19
 
20
- class ImportService:
21
  """Main service class for handling imports from various sources"""
22
 
23
  def __init__(self):
 
17
 
18
  logger = logging.getLogger(__name__)
19
 
20
+ class ImportingService:
21
  """Main service class for handling imports from various sources"""
22
 
23
  def __init__(self):
vms/services/{importer β†’ importing}/youtube.py RENAMED
File without changes
vms/services/previewing.py ADDED
@@ -0,0 +1,406 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Preview service for Video Model Studio
3
+
4
+ Handles the video generation logic and model integration
5
+ """
6
+
7
+ import logging
8
+ import tempfile
9
+ import torch
10
+ from pathlib import Path
11
+ from typing import Dict, Any, List, Optional, Tuple, Callable
12
+
13
+ from vms.config import (
14
+ OUTPUT_PATH, STORAGE_PATH, MODEL_TYPES, TRAINING_PATH,
15
+ DEFAULT_PROMPT_PREFIX
16
+ )
17
+ from vms.utils import format_time
18
+
19
+ logger = logging.getLogger(__name__)
20
+
21
+ class PreviewingService:
22
+ """Handles the video generation logic and model integration"""
23
+
24
+ def __init__(self):
25
+ """Initialize the preview service"""
26
+ pass
27
+
28
+ def find_latest_lora_weights(self) -> Optional[str]:
29
+ """Find the latest LoRA weights file"""
30
+ try:
31
+ lora_path = OUTPUT_PATH / "pytorch_lora_weights.safetensors"
32
+ if lora_path.exists():
33
+ return str(lora_path)
34
+
35
+ # If not found in the expected location, try to find in checkpoints
36
+ checkpoints = list(OUTPUT_PATH.glob("checkpoint-*"))
37
+ if not checkpoints:
38
+ return None
39
+
40
+ latest_checkpoint = max(checkpoints, key=lambda x: int(x.name.split("-")[1]))
41
+ lora_path = latest_checkpoint / "pytorch_lora_weights.safetensors"
42
+
43
+ if lora_path.exists():
44
+ return str(lora_path)
45
+
46
+ return None
47
+ except Exception as e:
48
+ logger.error(f"Error finding LoRA weights: {e}")
49
+ return None
50
+
51
+ def generate_video(
52
+ self,
53
+ model_type: str,
54
+ prompt: str,
55
+ negative_prompt: str,
56
+ prompt_prefix: str,
57
+ width: int,
58
+ height: int,
59
+ num_frames: int,
60
+ guidance_scale: float,
61
+ flow_shift: float,
62
+ lora_weight: float,
63
+ inference_steps: int,
64
+ enable_cpu_offload: bool,
65
+ fps: int
66
+ ) -> Tuple[Optional[str], str, str]:
67
+ """Generate a video using the trained model"""
68
+ try:
69
+ log_messages = []
70
+
71
+ def log(msg: str):
72
+ log_messages.append(msg)
73
+ logger.info(msg)
74
+ return "\n".join(log_messages)
75
+
76
+ # Find latest LoRA weights
77
+ lora_path = self.find_latest_lora_weights()
78
+ if not lora_path:
79
+ return None, "Error: No LoRA weights found", log("Error: No LoRA weights found in output directory")
80
+
81
+ # Add prefix to prompt
82
+ if prompt_prefix and not prompt.startswith(prompt_prefix):
83
+ full_prompt = f"{prompt_prefix}{prompt}"
84
+ else:
85
+ full_prompt = prompt
86
+
87
+ # Create correct num_frames (should be 8*k + 1)
88
+ adjusted_num_frames = ((num_frames - 1) // 8) * 8 + 1
89
+ if adjusted_num_frames != num_frames:
90
+ log(f"Adjusted number of frames from {num_frames} to {adjusted_num_frames} to match model requirements")
91
+ num_frames = adjusted_num_frames
92
+
93
+ # Get model type (internal name)
94
+ internal_model_type = MODEL_TYPES.get(model_type)
95
+ if not internal_model_type:
96
+ return None, f"Error: Invalid model type {model_type}", log(f"Error: Invalid model type {model_type}")
97
+
98
+ log(f"Generating video with model type: {internal_model_type}")
99
+ log(f"Using LoRA weights from: {lora_path}")
100
+ log(f"Resolution: {width}x{height}, Frames: {num_frames}, FPS: {fps}")
101
+ log(f"Guidance Scale: {guidance_scale}, Flow Shift: {flow_shift}, LoRA Weight: {lora_weight}")
102
+ log(f"Prompt: {full_prompt}")
103
+ log(f"Negative Prompt: {negative_prompt}")
104
+
105
+ # Import required components based on model type
106
+ if internal_model_type == "wan":
107
+ return self.generate_wan_video(
108
+ full_prompt, negative_prompt, width, height, num_frames,
109
+ guidance_scale, flow_shift, lora_path, lora_weight,
110
+ inference_steps, enable_cpu_offload, fps, log
111
+ )
112
+ elif internal_model_type == "ltx_video":
113
+ return self.generate_ltx_video(
114
+ full_prompt, negative_prompt, width, height, num_frames,
115
+ guidance_scale, flow_shift, lora_path, lora_weight,
116
+ inference_steps, enable_cpu_offload, fps, log
117
+ )
118
+ elif internal_model_type == "hunyuan_video":
119
+ return self.generate_hunyuan_video(
120
+ full_prompt, negative_prompt, width, height, num_frames,
121
+ guidance_scale, flow_shift, lora_path, lora_weight,
122
+ inference_steps, enable_cpu_offload, fps, log
123
+ )
124
+ else:
125
+ return None, f"Error: Unsupported model type {internal_model_type}", log(f"Error: Unsupported model type {internal_model_type}")
126
+
127
+ except Exception as e:
128
+ logger.exception("Error generating video")
129
+ return None, f"Error: {str(e)}", f"Exception occurred: {str(e)}"
130
+
131
+ def generate_wan_video(
132
+ self,
133
+ prompt: str,
134
+ negative_prompt: str,
135
+ width: int,
136
+ height: int,
137
+ num_frames: int,
138
+ guidance_scale: float,
139
+ flow_shift: float,
140
+ lora_path: str,
141
+ lora_weight: float,
142
+ inference_steps: int,
143
+ enable_cpu_offload: bool,
144
+ fps: int,
145
+ log_fn: Callable
146
+ ) -> Tuple[Optional[str], str, str]:
147
+ """Generate video using Wan model"""
148
+ start_time = torch.cuda.Event(enable_timing=True)
149
+ end_time = torch.cuda.Event(enable_timing=True)
150
+
151
+ try:
152
+ import torch
153
+ from diffusers import AutoencoderKLWan, WanPipeline
154
+ from diffusers.schedulers.scheduling_unipc_multistep import UniPCMultistepScheduler
155
+ from diffusers.utils import export_to_video
156
+
157
+ log_fn("Importing Wan model components...")
158
+
159
+ # Use the smaller model for faster inference
160
+ model_id = "Wan-AI/Wan2.1-T2V-1.3B-Diffusers"
161
+
162
+ log_fn(f"Loading VAE from {model_id}...")
163
+ vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
164
+
165
+ log_fn(f"Loading transformer from {model_id}...")
166
+ pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
167
+
168
+ log_fn(f"Configuring scheduler with flow_shift={flow_shift}...")
169
+ pipe.scheduler = UniPCMultistepScheduler.from_config(
170
+ pipe.scheduler.config,
171
+ flow_shift=flow_shift
172
+ )
173
+
174
+ log_fn("Moving pipeline to CUDA device...")
175
+ pipe.to("cuda")
176
+
177
+ if enable_cpu_offload:
178
+ log_fn("Enabling model CPU offload...")
179
+ pipe.enable_model_cpu_offload()
180
+
181
+ log_fn(f"Loading LoRA weights from {lora_path} with weight {lora_weight}...")
182
+ pipe.load_lora_weights(lora_path)
183
+ pipe.fuse_lora(lora_weight)
184
+
185
+ # Create temporary file for the output
186
+ with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as temp_file:
187
+ output_path = temp_file.name
188
+
189
+ log_fn("Starting video generation...")
190
+ start_time.record()
191
+
192
+ output = pipe(
193
+ prompt=prompt,
194
+ negative_prompt=negative_prompt,
195
+ height=height,
196
+ width=width,
197
+ num_frames=num_frames,
198
+ guidance_scale=guidance_scale,
199
+ num_inference_steps=inference_steps,
200
+ ).frames[0]
201
+
202
+ end_time.record()
203
+ torch.cuda.synchronize()
204
+ generation_time = start_time.elapsed_time(end_time) / 1000 # Convert to seconds
205
+
206
+ log_fn(f"Video generation completed in {format_time(generation_time)}")
207
+ log_fn(f"Exporting video to {output_path}...")
208
+
209
+ export_to_video(output, output_path, fps=fps)
210
+
211
+ log_fn("Video generation and export completed successfully!")
212
+
213
+ # Clean up CUDA memory
214
+ pipe = None
215
+ torch.cuda.empty_cache()
216
+
217
+ return output_path, "Video generated successfully!", log_fn(f"Generation completed in {format_time(generation_time)}")
218
+
219
+ except Exception as e:
220
+ log_fn(f"Error generating video with Wan: {str(e)}")
221
+ # Clean up CUDA memory
222
+ torch.cuda.empty_cache()
223
+ return None, f"Error: {str(e)}", log_fn(f"Exception occurred: {str(e)}")
224
+
225
+ def generate_ltx_video(
226
+ self,
227
+ prompt: str,
228
+ negative_prompt: str,
229
+ width: int,
230
+ height: int,
231
+ num_frames: int,
232
+ guidance_scale: float,
233
+ flow_shift: float,
234
+ lora_path: str,
235
+ lora_weight: float,
236
+ inference_steps: int,
237
+ enable_cpu_offload: bool,
238
+ fps: int,
239
+ log_fn: Callable
240
+ ) -> Tuple[Optional[str], str, str]:
241
+ """Generate video using LTX model"""
242
+ start_time = torch.cuda.Event(enable_timing=True)
243
+ end_time = torch.cuda.Event(enable_timing=True)
244
+
245
+ try:
246
+ import torch
247
+ from diffusers import LTXPipeline
248
+ from diffusers.utils import export_to_video
249
+
250
+ log_fn("Importing LTX model components...")
251
+
252
+ model_id = "Lightricks/LTX-Video"
253
+
254
+ log_fn(f"Loading pipeline from {model_id}...")
255
+ pipe = LTXPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16)
256
+
257
+ log_fn("Moving pipeline to CUDA device...")
258
+ pipe.to("cuda")
259
+
260
+ if enable_cpu_offload:
261
+ log_fn("Enabling model CPU offload...")
262
+ pipe.enable_model_cpu_offload()
263
+
264
+ log_fn(f"Loading LoRA weights from {lora_path} with weight {lora_weight}...")
265
+ pipe.load_lora_weights(lora_path)
266
+ pipe.fuse_lora(lora_weight)
267
+
268
+ # Create temporary file for the output
269
+ with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as temp_file:
270
+ output_path = temp_file.name
271
+
272
+ log_fn("Starting video generation...")
273
+ start_time.record()
274
+
275
+ video = pipe(
276
+ prompt=prompt,
277
+ negative_prompt=negative_prompt,
278
+ height=height,
279
+ width=width,
280
+ num_frames=num_frames,
281
+ guidance_scale=guidance_scale,
282
+ decode_timestep=0.03,
283
+ decode_noise_scale=0.025,
284
+ num_inference_steps=inference_steps,
285
+ ).frames[0]
286
+
287
+ end_time.record()
288
+ torch.cuda.synchronize()
289
+ generation_time = start_time.elapsed_time(end_time) / 1000 # Convert to seconds
290
+
291
+ log_fn(f"Video generation completed in {format_time(generation_time)}")
292
+ log_fn(f"Exporting video to {output_path}...")
293
+
294
+ export_to_video(video, output_path, fps=fps)
295
+
296
+ log_fn("Video generation and export completed successfully!")
297
+
298
+ # Clean up CUDA memory
299
+ pipe = None
300
+ torch.cuda.empty_cache()
301
+
302
+ return output_path, "Video generated successfully!", log_fn(f"Generation completed in {format_time(generation_time)}")
303
+
304
+ except Exception as e:
305
+ log_fn(f"Error generating video with LTX: {str(e)}")
306
+ # Clean up CUDA memory
307
+ torch.cuda.empty_cache()
308
+ return None, f"Error: {str(e)}", log_fn(f"Exception occurred: {str(e)}")
309
+
310
+ def generate_hunyuan_video(
311
+ self,
312
+ prompt: str,
313
+ negative_prompt: str,
314
+ width: int,
315
+ height: int,
316
+ num_frames: int,
317
+ guidance_scale: float,
318
+ flow_shift: float,
319
+ lora_path: str,
320
+ lora_weight: float,
321
+ inference_steps: int,
322
+ enable_cpu_offload: bool,
323
+ fps: int,
324
+ log_fn: Callable
325
+ ) -> Tuple[Optional[str], str, str]:
326
+ """Generate video using HunyuanVideo model"""
327
+ start_time = torch.cuda.Event(enable_timing=True)
328
+ end_time = torch.cuda.Event(enable_timing=True)
329
+
330
+ try:
331
+ import torch
332
+ from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel, AutoencoderKLHunyuanVideo
333
+ from diffusers.utils import export_to_video
334
+
335
+ log_fn("Importing HunyuanVideo model components...")
336
+
337
+ model_id = "hunyuanvideo-community/HunyuanVideo"
338
+
339
+ log_fn(f"Loading transformer from {model_id}...")
340
+ transformer = HunyuanVideoTransformer3DModel.from_pretrained(
341
+ model_id,
342
+ subfolder="transformer",
343
+ torch_dtype=torch.bfloat16
344
+ )
345
+
346
+ log_fn(f"Loading pipeline from {model_id}...")
347
+ pipe = HunyuanVideoPipeline.from_pretrained(
348
+ model_id,
349
+ transformer=transformer,
350
+ torch_dtype=torch.float16
351
+ )
352
+
353
+ log_fn("Enabling VAE tiling for better memory usage...")
354
+ pipe.vae.enable_tiling()
355
+
356
+ log_fn("Moving pipeline to CUDA device...")
357
+ pipe.to("cuda")
358
+
359
+ if enable_cpu_offload:
360
+ log_fn("Enabling model CPU offload...")
361
+ pipe.enable_model_cpu_offload()
362
+
363
+ log_fn(f"Loading LoRA weights from {lora_path} with weight {lora_weight}...")
364
+ pipe.load_lora_weights(lora_path)
365
+ pipe.fuse_lora(lora_weight)
366
+
367
+ # Create temporary file for the output
368
+ with tempfile.NamedTemporaryFile(suffix='.mp4', delete=False) as temp_file:
369
+ output_path = temp_file.name
370
+
371
+ log_fn("Starting video generation...")
372
+ start_time.record()
373
+
374
+ output = pipe(
375
+ prompt=prompt,
376
+ negative_prompt=negative_prompt if negative_prompt else None,
377
+ height=height,
378
+ width=width,
379
+ num_frames=num_frames,
380
+ guidance_scale=guidance_scale,
381
+ true_cfg_scale=1.0,
382
+ num_inference_steps=inference_steps,
383
+ ).frames[0]
384
+
385
+ end_time.record()
386
+ torch.cuda.synchronize()
387
+ generation_time = start_time.elapsed_time(end_time) / 1000 # Convert to seconds
388
+
389
+ log_fn(f"Video generation completed in {format_time(generation_time)}")
390
+ log_fn(f"Exporting video to {output_path}...")
391
+
392
+ export_to_video(output, output_path, fps=fps)
393
+
394
+ log_fn("Video generation and export completed successfully!")
395
+
396
+ # Clean up CUDA memory
397
+ pipe = None
398
+ torch.cuda.empty_cache()
399
+
400
+ return output_path, "Video generated successfully!", log_fn(f"Generation completed in {format_time(generation_time)}")
401
+
402
+ except Exception as e:
403
+ log_fn(f"Error generating video with HunyuanVideo: {str(e)}")
404
+ # Clean up CUDA memory
405
+ torch.cuda.empty_cache()
406
+ return None, f"Error: {str(e)}", log_fn(f"Exception occurred: {str(e)}")
vms/services/{splitter.py β†’ splitting.py} RENAMED
File without changes
vms/services/{trainer.py β†’ training.py} RENAMED
File without changes
vms/tabs/caption_tab.py CHANGED
@@ -224,8 +224,8 @@ class CaptionTab(BaseTab):
224
  self._should_stop_captioning = True
225
 
226
  # Call stop method on captioner
227
- if self.app.captioner:
228
- self.app.captioner.stop_captioning()
229
 
230
  # Get updated file list
231
  updated_list = self.list_training_files_to_caption()
@@ -286,7 +286,7 @@ class CaptionTab(BaseTab):
286
  file_statuses = {}
287
 
288
  # Start the actual captioning process
289
- async for rows in self.app.captioner.start_caption_generation(captioning_bot_instructions, prompt_prefix):
290
  # Update our tracking of file statuses
291
  for name, status in rows:
292
  file_statuses[name] = status
@@ -516,7 +516,7 @@ class CaptionTab(BaseTab):
516
  # Use the original file path stored during selection instead of the temporary preview paths
517
  if original_file_path:
518
  file_path = Path(original_file_path)
519
- self.app.captioner.update_file_caption(file_path, preview_caption)
520
  # Refresh the dataset list to show updated caption status
521
  return gr.update(value="Caption saved successfully!")
522
  else:
 
224
  self._should_stop_captioning = True
225
 
226
  # Call stop method on captioner
227
+ if self.app.captioning:
228
+ self.app.captioning.stop_captioning()
229
 
230
  # Get updated file list
231
  updated_list = self.list_training_files_to_caption()
 
286
  file_statuses = {}
287
 
288
  # Start the actual captioning process
289
+ async for rows in self.app.captioning.start_caption_generation(captioning_bot_instructions, prompt_prefix):
290
  # Update our tracking of file statuses
291
  for name, status in rows:
292
  file_statuses[name] = status
 
516
  # Use the original file path stored during selection instead of the temporary preview paths
517
  if original_file_path:
518
  file_path = Path(original_file_path)
519
+ self.app.captioning.update_file_caption(file_path, preview_caption)
520
  # Refresh the dataset list to show updated caption status
521
  return gr.update(value="Caption saved successfully!")
522
  else:
vms/tabs/import_tab/hub_tab.py CHANGED
@@ -168,7 +168,7 @@ class HubTab(BaseTab):
168
  """Search datasets on the Hub matching the query"""
169
  try:
170
  logger.info(f"Searching for datasets with query: '{query}'")
171
- results_full = self.app.importer.search_datasets(query)
172
 
173
  # Extract just the first column (dataset IDs) for display
174
  results = [[row[0]] for row in results_full]
@@ -199,7 +199,7 @@ class HubTab(BaseTab):
199
  logger.info(f"Getting dataset info for: {dataset_id}")
200
 
201
  # Use the importer service to get dataset info
202
- info_text, file_counts, _ = self.app.importer.get_dataset_info(dataset_id)
203
 
204
  # Get counts of each file type
205
  video_count = file_counts.get("video", 0)
@@ -247,7 +247,7 @@ class HubTab(BaseTab):
247
  progress_callback(fraction, desc=desc)
248
 
249
  # Call the actual download function with our adapter
250
- result = await self.app.importer.download_file_group(
251
  dataset_id,
252
  file_type,
253
  enable_splitting,
 
168
  """Search datasets on the Hub matching the query"""
169
  try:
170
  logger.info(f"Searching for datasets with query: '{query}'")
171
+ results_full = self.app.importing.search_datasets(query)
172
 
173
  # Extract just the first column (dataset IDs) for display
174
  results = [[row[0]] for row in results_full]
 
199
  logger.info(f"Getting dataset info for: {dataset_id}")
200
 
201
  # Use the importer service to get dataset info
202
+ info_text, file_counts, _ = self.app.importing.get_dataset_info(dataset_id)
203
 
204
  # Get counts of each file type
205
  video_count = file_counts.get("video", 0)
 
247
  progress_callback(fraction, desc=desc)
248
 
249
  # Call the actual download function with our adapter
250
+ result = await self.app.importing.download_file_group(
251
  dataset_id,
252
  file_type,
253
  enable_splitting,
vms/tabs/import_tab/import_tab.py CHANGED
@@ -89,7 +89,7 @@ class ImportTab(BaseTab):
89
 
90
  # If scene detection isn't already running and there are videos to process,
91
  # and auto-splitting is enabled, start the detection
92
- if videos and not self.app.splitter.is_processing() and enable_splitting:
93
  # Start the scene detection in a separate thread
94
  self._start_scene_detection_bg(enable_splitting)
95
  msg = "Starting automatic scene detection..."
@@ -133,7 +133,7 @@ class ImportTab(BaseTab):
133
  try:
134
  async def copy_files():
135
  for video_file in VIDEOS_TO_SPLIT_PATH.glob("*.mp4"):
136
- await self.app.splitter.process_video(video_file, enable_splitting=False)
137
 
138
  loop.run_until_complete(copy_files())
139
  except Exception as e:
 
89
 
90
  # If scene detection isn't already running and there are videos to process,
91
  # and auto-splitting is enabled, start the detection
92
+ if videos and not self.app.splitting.is_processing() and enable_splitting:
93
  # Start the scene detection in a separate thread
94
  self._start_scene_detection_bg(enable_splitting)
95
  msg = "Starting automatic scene detection..."
 
133
  try:
134
  async def copy_files():
135
  for video_file in VIDEOS_TO_SPLIT_PATH.glob("*.mp4"):
136
+ await self.app.splitting.process_video(video_file, enable_splitting=False)
137
 
138
  loop.run_until_complete(copy_files())
139
  except Exception as e:
vms/tabs/import_tab/upload_tab.py CHANGED
@@ -53,7 +53,7 @@ class UploadTab(BaseTab):
53
  """Connect event handlers to UI components"""
54
  # File upload event
55
  self.components["files"].upload(
56
- fn=lambda x: self.app.importer.process_uploaded_files(x),
57
  inputs=[self.components["files"]],
58
  outputs=[self.components["import_status"]] # This comes from parent tab
59
  ).success(
 
53
  """Connect event handlers to UI components"""
54
  # File upload event
55
  self.components["files"].upload(
56
+ fn=lambda x: self.app.importing.process_uploaded_files(x),
57
  inputs=[self.components["files"]],
58
  outputs=[self.components["import_status"]] # This comes from parent tab
59
  ).success(
vms/tabs/import_tab/youtube_tab.py CHANGED
@@ -49,7 +49,7 @@ class YouTubeTab(BaseTab):
49
  """Connect event handlers to UI components"""
50
  # YouTube download event
51
  self.components["youtube_download_btn"].click(
52
- fn=self.app.importer.download_youtube_video,
53
  inputs=[self.components["youtube_url"]],
54
  outputs=[self.components["import_status"]] # This comes from parent tab
55
  ).success(
 
49
  """Connect event handlers to UI components"""
50
  # YouTube download event
51
  self.components["youtube_download_btn"].click(
52
+ fn=self.app.importing.download_youtube_video,
53
  inputs=[self.components["youtube_url"]],
54
  outputs=[self.components["import_status"]] # This comes from parent tab
55
  ).success(
vms/tabs/manage_tab.py CHANGED
@@ -23,7 +23,7 @@ class ManageTab(BaseTab):
23
  def __init__(self, app_state):
24
  super().__init__(app_state)
25
  self.id = "manage_tab"
26
- self.title = "6️⃣ Manage"
27
 
28
  def create(self, parent=None) -> gr.TabItem:
29
  """Create the Manage tab UI components"""
@@ -90,12 +90,12 @@ class ManageTab(BaseTab):
90
 
91
  # Download buttons
92
  self.components["download_dataset_btn"].click(
93
- fn=self.app.trainer.create_training_dataset_zip,
94
  outputs=[self.components["download_dataset_btn"]]
95
  )
96
 
97
  self.components["download_model_btn"].click(
98
- fn=self.app.trainer.get_model_output_safetensors,
99
  outputs=[self.components["download_model_btn"]]
100
  )
101
 
@@ -139,11 +139,11 @@ class ManageTab(BaseTab):
139
  return f"Error: {validation['error']}"
140
 
141
  # Check if we have a model to upload
142
- if not self.app.trainer.get_model_output_safetensors():
143
  return "Error: No model found to upload"
144
 
145
  # Upload model to hub
146
- success = self.app.trainer.upload_to_hub(OUTPUT_PATH, repo_id)
147
 
148
  if success:
149
  return f"Successfully uploaded model to {repo_id}"
@@ -184,25 +184,25 @@ class ManageTab(BaseTab):
184
 
185
  try:
186
  # Stop training if running
187
- if self.app.trainer.is_training_running():
188
- training_result = self.app.trainer.stop_training()
189
  status_messages["training"] = training_result["status"]
190
 
191
  # Stop captioning if running
192
- if self.app.captioner:
193
- self.app.captioner.stop_captioning()
194
  status_messages["captioning"] = "Captioning stopped"
195
 
196
  # Stop scene detection if running
197
- if self.app.splitter.is_processing():
198
- self.app.splitter.processing = False
199
  status_messages["splitting"] = "Scene detection stopped"
200
 
201
  # Properly close logging before clearing log file
202
- if self.app.trainer.file_handler:
203
- self.app.trainer.file_handler.close()
204
- logger.removeHandler(self.app.trainer.file_handler)
205
- self.app.trainer.file_handler = None
206
 
207
  if LOG_FILE_PATH.exists():
208
  LOG_FILE_PATH.unlink()
@@ -221,10 +221,10 @@ class ManageTab(BaseTab):
221
 
222
  # Reset any persistent state
223
  self.app.tabs["caption_tab"]._should_stop_captioning = True
224
- self.app.splitter.processing = False
225
 
226
  # Recreate logging setup
227
- self.app.trainer.setup_logging()
228
 
229
  return {
230
  "status": "All processes stopped and data cleared",
 
23
  def __init__(self, app_state):
24
  super().__init__(app_state)
25
  self.id = "manage_tab"
26
+ self.title = "7️⃣ Manage"
27
 
28
  def create(self, parent=None) -> gr.TabItem:
29
  """Create the Manage tab UI components"""
 
90
 
91
  # Download buttons
92
  self.components["download_dataset_btn"].click(
93
+ fn=self.app.training.create_training_dataset_zip,
94
  outputs=[self.components["download_dataset_btn"]]
95
  )
96
 
97
  self.components["download_model_btn"].click(
98
+ fn=self.app.training.get_model_output_safetensors,
99
  outputs=[self.components["download_model_btn"]]
100
  )
101
 
 
139
  return f"Error: {validation['error']}"
140
 
141
  # Check if we have a model to upload
142
+ if not self.app.training.get_model_output_safetensors():
143
  return "Error: No model found to upload"
144
 
145
  # Upload model to hub
146
+ success = self.app.training.upload_to_hub(OUTPUT_PATH, repo_id)
147
 
148
  if success:
149
  return f"Successfully uploaded model to {repo_id}"
 
184
 
185
  try:
186
  # Stop training if running
187
+ if self.app.training.is_training_running():
188
+ training_result = self.app.training.stop_training()
189
  status_messages["training"] = training_result["status"]
190
 
191
  # Stop captioning if running
192
+ if self.app.captioning:
193
+ self.app.captioning.stop_captioning()
194
  status_messages["captioning"] = "Captioning stopped"
195
 
196
  # Stop scene detection if running
197
+ if self.app.splitting.is_processing():
198
+ self.app.splitting.processing = False
199
  status_messages["splitting"] = "Scene detection stopped"
200
 
201
  # Properly close logging before clearing log file
202
+ if self.app.training.file_handler:
203
+ self.app.training.file_handler.close()
204
+ logger.removeHandler(self.app.training.file_handler)
205
+ self.app.training.file_handler = None
206
 
207
  if LOG_FILE_PATH.exists():
208
  LOG_FILE_PATH.unlink()
 
221
 
222
  # Reset any persistent state
223
  self.app.tabs["caption_tab"]._should_stop_captioning = True
224
+ self.app.splitting.processing = False
225
 
226
  # Recreate logging setup
227
+ self.app.training.setup_logging()
228
 
229
  return {
230
  "status": "All processes stopped and data cleared",
vms/tabs/monitor_tab.py CHANGED
@@ -140,8 +140,8 @@ class MonitorTab(BaseTab):
140
  def on_enter(self):
141
  """Called when the tab is selected"""
142
  # Start monitoring service if not already running
143
- if not self.app.monitor.is_running:
144
- self.app.monitor.start_monitoring()
145
 
146
  # Trigger initial refresh
147
  return self.refresh_all()
@@ -178,7 +178,7 @@ class MonitorTab(BaseTab):
178
  """
179
  try:
180
  # Get system info
181
- system_info = self.app.monitor.get_system_info()
182
 
183
  # Split system info into separate components
184
  system_info_html = self.format_system_info(system_info)
@@ -187,13 +187,13 @@ class MonitorTab(BaseTab):
187
  storage_info_html = self.format_storage_info()
188
 
189
  # Get current metrics
190
- # current_metrics = self.app.monitor.get_current_metrics()
191
  metrics_html = "" # self.format_current_metrics(current_metrics)
192
 
193
  # Generate plots
194
- cpu_plot = self.app.monitor.generate_cpu_plot()
195
- memory_plot = self.app.monitor.generate_memory_plot()
196
- #per_core_plot = self.app.monitor.generate_per_core_plot()
197
 
198
  return (
199
  system_info_html,
 
140
  def on_enter(self):
141
  """Called when the tab is selected"""
142
  # Start monitoring service if not already running
143
+ if not self.app.monitoring.is_running:
144
+ self.app.monitoring.start_monitoring()
145
 
146
  # Trigger initial refresh
147
  return self.refresh_all()
 
178
  """
179
  try:
180
  # Get system info
181
+ system_info = self.app.monitoring.get_system_info()
182
 
183
  # Split system info into separate components
184
  system_info_html = self.format_system_info(system_info)
 
187
  storage_info_html = self.format_storage_info()
188
 
189
  # Get current metrics
190
+ # current_metrics = self.app.monitoring.get_current_metrics()
191
  metrics_html = "" # self.format_current_metrics(current_metrics)
192
 
193
  # Generate plots
194
+ cpu_plot = self.app.monitoring.generate_cpu_plot()
195
+ memory_plot = self.app.monitoring.generate_memory_plot()
196
+ #per_core_plot = self.app.monitoring.generate_per_core_plot()
197
 
198
  return (
199
  system_info_html,
vms/tabs/preview_tab.py ADDED
@@ -0,0 +1,240 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Preview tab for Video Model Studio UI
3
+ """
4
+
5
+ import gradio as gr
6
+ import logging
7
+ from pathlib import Path
8
+ from typing import Dict, Any, List, Optional, Tuple
9
+
10
+ from vms.services.base_tab import BaseTab
11
+ from vms.config import (
12
+ MODEL_TYPES, DEFAULT_PROMPT_PREFIX
13
+ )
14
+
15
+ logger = logging.getLogger(__name__)
16
+
17
+ class PreviewTab(BaseTab):
18
+ """Preview tab for testing trained models"""
19
+
20
+ def __init__(self, app_state):
21
+ super().__init__(app_state)
22
+ self.id = "preview_tab"
23
+ self.title = "6️⃣ Preview"
24
+
25
+ # Get reference to the preview service from app_state
26
+ self.previewing_service = app_state.previewing
27
+
28
+ def create(self, parent=None) -> gr.TabItem:
29
+ """Create the Preview tab UI components"""
30
+ with gr.TabItem(self.title, id=self.id) as tab:
31
+ with gr.Row():
32
+ gr.Markdown("## Test Your Trained Model")
33
+
34
+ with gr.Row():
35
+ with gr.Column(scale=2):
36
+ self.components["prompt"] = gr.Textbox(
37
+ label="Prompt",
38
+ placeholder="Enter your prompt here...",
39
+ lines=3
40
+ )
41
+
42
+ self.components["negative_prompt"] = gr.Textbox(
43
+ label="Negative Prompt",
44
+ placeholder="Enter negative prompt here...",
45
+ lines=3,
46
+ value="worst quality, low quality, blurry, jittery, distorted, ugly, deformed, disfigured, messy background"
47
+ )
48
+
49
+ self.components["prompt_prefix"] = gr.Textbox(
50
+ label="Global Prompt Prefix",
51
+ placeholder="Prefix to add to all prompts",
52
+ value=DEFAULT_PROMPT_PREFIX
53
+ )
54
+
55
+ with gr.Row():
56
+ self.components["model_type"] = gr.Dropdown(
57
+ choices=list(MODEL_TYPES.keys()),
58
+ label="Model Type",
59
+ value=list(MODEL_TYPES.keys())[0]
60
+ )
61
+
62
+ self.components["resolution_preset"] = gr.Dropdown(
63
+ choices=["480p", "720p"],
64
+ label="Resolution Preset",
65
+ value="480p"
66
+ )
67
+
68
+ with gr.Row():
69
+ self.components["width"] = gr.Number(
70
+ label="Width",
71
+ value=832,
72
+ precision=0
73
+ )
74
+
75
+ self.components["height"] = gr.Number(
76
+ label="Height",
77
+ value=480,
78
+ precision=0
79
+ )
80
+
81
+ with gr.Row():
82
+ self.components["num_frames"] = gr.Slider(
83
+ label="Number of Frames",
84
+ minimum=1,
85
+ maximum=257,
86
+ step=8,
87
+ value=49
88
+ )
89
+
90
+ self.components["fps"] = gr.Slider(
91
+ label="FPS",
92
+ minimum=1,
93
+ maximum=60,
94
+ step=1,
95
+ value=16
96
+ )
97
+
98
+ with gr.Row():
99
+ self.components["guidance_scale"] = gr.Slider(
100
+ label="Guidance Scale",
101
+ minimum=1.0,
102
+ maximum=10.0,
103
+ step=0.1,
104
+ value=5.0
105
+ )
106
+
107
+ self.components["flow_shift"] = gr.Slider(
108
+ label="Flow Shift",
109
+ minimum=0.0,
110
+ maximum=10.0,
111
+ step=0.1,
112
+ value=3.0
113
+ )
114
+
115
+ with gr.Row():
116
+ self.components["lora_weight"] = gr.Slider(
117
+ label="LoRA Weight",
118
+ minimum=0.0,
119
+ maximum=1.0,
120
+ step=0.01,
121
+ value=0.7
122
+ )
123
+
124
+ self.components["inference_steps"] = gr.Slider(
125
+ label="Inference Steps",
126
+ minimum=1,
127
+ maximum=100,
128
+ step=1,
129
+ value=30
130
+ )
131
+
132
+ self.components["enable_cpu_offload"] = gr.Checkbox(
133
+ label="Enable Model CPU Offload (for low-VRAM GPUs)",
134
+ value=True
135
+ )
136
+
137
+ self.components["generate_btn"] = gr.Button(
138
+ "Generate Video",
139
+ variant="primary"
140
+ )
141
+
142
+ with gr.Column(scale=3):
143
+ self.components["preview_video"] = gr.Video(
144
+ label="Generated Video",
145
+ interactive=False
146
+ )
147
+
148
+ self.components["status"] = gr.Textbox(
149
+ label="Status",
150
+ interactive=False
151
+ )
152
+
153
+ with gr.Accordion("Log", open=False):
154
+ self.components["log"] = gr.TextArea(
155
+ label="Generation Log",
156
+ interactive=False,
157
+ lines=10
158
+ )
159
+
160
+ return tab
161
+
162
+ def connect_events(self) -> None:
163
+ """Connect event handlers to UI components"""
164
+ # Update resolution when preset changes
165
+ self.components["resolution_preset"].change(
166
+ fn=self.update_resolution,
167
+ inputs=[self.components["resolution_preset"]],
168
+ outputs=[
169
+ self.components["width"],
170
+ self.components["height"],
171
+ self.components["flow_shift"]
172
+ ]
173
+ )
174
+
175
+ # Generate button click
176
+ self.components["generate_btn"].click(
177
+ fn=self.generate_video,
178
+ inputs=[
179
+ self.components["model_type"],
180
+ self.components["prompt"],
181
+ self.components["negative_prompt"],
182
+ self.components["prompt_prefix"],
183
+ self.components["width"],
184
+ self.components["height"],
185
+ self.components["num_frames"],
186
+ self.components["guidance_scale"],
187
+ self.components["flow_shift"],
188
+ self.components["lora_weight"],
189
+ self.components["inference_steps"],
190
+ self.components["enable_cpu_offload"],
191
+ self.components["fps"]
192
+ ],
193
+ outputs=[
194
+ self.components["preview_video"],
195
+ self.components["status"],
196
+ self.components["log"]
197
+ ]
198
+ )
199
+
200
+ def update_resolution(self, preset: str) -> Tuple[int, int, float]:
201
+ """Update resolution and flow shift based on preset"""
202
+ if preset == "480p":
203
+ return 832, 480, 3.0
204
+ elif preset == "720p":
205
+ return 1280, 720, 5.0
206
+ else:
207
+ return 832, 480, 3.0
208
+
209
+ def generate_video(
210
+ self,
211
+ model_type: str,
212
+ prompt: str,
213
+ negative_prompt: str,
214
+ prompt_prefix: str,
215
+ width: int,
216
+ height: int,
217
+ num_frames: int,
218
+ guidance_scale: float,
219
+ flow_shift: float,
220
+ lora_weight: float,
221
+ inference_steps: int,
222
+ enable_cpu_offload: bool,
223
+ fps: int
224
+ ) -> Tuple[Optional[str], str, str]:
225
+ """Handler for generate button click, delegates to preview service"""
226
+ return self.preview_service.generate_video(
227
+ model_type=model_type,
228
+ prompt=prompt,
229
+ negative_prompt=negative_prompt,
230
+ prompt_prefix=prompt_prefix,
231
+ width=width,
232
+ height=height,
233
+ num_frames=num_frames,
234
+ guidance_scale=guidance_scale,
235
+ flow_shift=flow_shift,
236
+ lora_weight=lora_weight,
237
+ inference_steps=inference_steps,
238
+ enable_cpu_offload=enable_cpu_offload,
239
+ fps=fps
240
+ )
vms/tabs/split_tab.py CHANGED
@@ -57,7 +57,7 @@ class SplitTab(BaseTab):
57
 
58
  def list_unprocessed_videos(self) -> gr.Dataframe:
59
  """Update list of unprocessed videos"""
60
- videos = self.app.splitter.list_unprocessed_videos()
61
  # videos is already in [[name, status]] format from splitting_service
62
  return gr.Dataframe(
63
  headers=["name", "status"],
@@ -71,11 +71,11 @@ class SplitTab(BaseTab):
71
  Args:
72
  enable_splitting: Whether to split videos into scenes
73
  """
74
- if self.app.splitter.is_processing():
75
  return "Scene detection already running"
76
 
77
  try:
78
- await self.app.splitter.start_processing(enable_splitting)
79
  return "Scene detection completed"
80
  except Exception as e:
81
  return f"Error during scene detection: {str(e)}"
 
57
 
58
  def list_unprocessed_videos(self) -> gr.Dataframe:
59
  """Update list of unprocessed videos"""
60
+ videos = self.app.splitting.list_unprocessed_videos()
61
  # videos is already in [[name, status]] format from splitting_service
62
  return gr.Dataframe(
63
  headers=["name", "status"],
 
71
  Args:
72
  enable_splitting: Whether to split videos into scenes
73
  """
74
+ if self.app.splitting.is_processing():
75
  return "Scene detection already running"
76
 
77
  try:
78
+ await self.app.splitting.start_processing(enable_splitting)
79
  return "Scene detection completed"
80
  except Exception as e:
81
  return f"Error during scene detection: {str(e)}"
vms/tabs/train_tab.py CHANGED
@@ -380,7 +380,7 @@ class TrainTab(BaseTab):
380
 
381
  # Add an event handler for delete_checkpoints_btn
382
  self.components["delete_checkpoints_btn"].click(
383
- fn=lambda: self.app.trainer.delete_all_checkpoints(),
384
  outputs=[self.components["status_box"]]
385
  )
386
 
@@ -437,7 +437,7 @@ class TrainTab(BaseTab):
437
 
438
  # Start training (it will automatically use the checkpoint if provided)
439
  try:
440
- return self.app.trainer.start_training(
441
  model_internal_type,
442
  lora_rank,
443
  lora_alpha,
@@ -620,13 +620,13 @@ class TrainTab(BaseTab):
620
 
621
  def get_latest_status_message_and_logs(self) -> Tuple[str, str, str]:
622
  """Get latest status message, log content, and status code in a safer way"""
623
- state = self.app.trainer.get_status()
624
- logs = self.app.trainer.get_logs()
625
 
626
  # Check if training process died unexpectedly
627
  training_died = False
628
 
629
- if state["status"] == "training" and not self.app.trainer.is_training_running():
630
  state["status"] = "error"
631
  state["message"] = "Training process terminated unexpectedly."
632
  training_died = True
@@ -769,16 +769,16 @@ class TrainTab(BaseTab):
769
  status, _, _ = self.get_latest_status_message_and_logs()
770
 
771
  if status == "paused":
772
- self.app.trainer.resume_training()
773
  else:
774
- self.app.trainer.pause_training()
775
 
776
  # Return the updates separately for text and buttons
777
  return (*self.get_status_updates(), *self.get_button_updates())
778
 
779
  def handle_stop(self):
780
  """Handle stop button click"""
781
- self.app.trainer.stop_training()
782
 
783
  # Return the updates separately for text and buttons
784
  return (*self.get_status_updates(), *self.get_button_updates())
 
380
 
381
  # Add an event handler for delete_checkpoints_btn
382
  self.components["delete_checkpoints_btn"].click(
383
+ fn=lambda: self.app.training.delete_all_checkpoints(),
384
  outputs=[self.components["status_box"]]
385
  )
386
 
 
437
 
438
  # Start training (it will automatically use the checkpoint if provided)
439
  try:
440
+ return self.app.training.start_training(
441
  model_internal_type,
442
  lora_rank,
443
  lora_alpha,
 
620
 
621
  def get_latest_status_message_and_logs(self) -> Tuple[str, str, str]:
622
  """Get latest status message, log content, and status code in a safer way"""
623
+ state = self.app.training.get_status()
624
+ logs = self.app.training.get_logs()
625
 
626
  # Check if training process died unexpectedly
627
  training_died = False
628
 
629
+ if state["status"] == "training" and not self.app.training.is_training_running():
630
  state["status"] = "error"
631
  state["message"] = "Training process terminated unexpectedly."
632
  training_died = True
 
769
  status, _, _ = self.get_latest_status_message_and_logs()
770
 
771
  if status == "paused":
772
+ self.app.training.resume_training()
773
  else:
774
+ self.app.training.pause_training()
775
 
776
  # Return the updates separately for text and buttons
777
  return (*self.get_status_updates(), *self.get_button_updates())
778
 
779
  def handle_stop(self):
780
  """Handle stop button click"""
781
+ self.app.training.stop_training()
782
 
783
  # Return the updates separately for text and buttons
784
  return (*self.get_status_updates(), *self.get_button_updates())
vms/ui/video_trainer_ui.py CHANGED
@@ -5,7 +5,7 @@ import logging
5
  import asyncio
6
  from typing import Any, Optional, Dict, List, Union, Tuple
7
 
8
- from ..services import TrainingService, CaptioningService, SplittingService, ImportService, MonitoringService
9
  from ..config import (
10
  STORAGE_PATH, VIDEOS_TO_SPLIT_PATH, STAGING_PATH, OUTPUT_PATH,
11
  TRAINING_PATH, LOG_FILE_PATH, TRAINING_PRESETS, TRAINING_VIDEOS_PATH, MODEL_PATH, OUTPUT_PATH,
@@ -40,17 +40,18 @@ class VideoTrainerUI:
40
  def __init__(self):
41
  """Initialize services and tabs"""
42
  # Initialize core services
43
- self.trainer = TrainingService(self)
44
- self.splitter = SplittingService()
45
- self.importer = ImportService()
46
- self.captioner = CaptioningService()
47
- self.monitor = MonitoringService()
 
48
 
49
  # Start the monitoring service on app creation
50
- self.monitor.start_monitoring()
51
 
52
  # Recovery status from any interrupted training
53
- recovery_result = self.trainer.recover_interrupted_training()
54
  # Add null check for recovery_result
55
  if recovery_result is None:
56
  recovery_result = {"status": "unknown", "ui_updates": {}}
@@ -267,7 +268,7 @@ class VideoTrainerUI:
267
  if ui_state:
268
  current_state = self.load_ui_values()
269
  current_state.update(ui_state)
270
- self.trainer.save_ui_state(current_state)
271
  logger.info(f"Updated UI state from recovery: {ui_state}")
272
 
273
  # Load values (potentially with recovery updates applied)
@@ -384,15 +385,15 @@ class VideoTrainerUI:
384
 
385
  def update_ui_state(self, **kwargs):
386
  """Update UI state with new values"""
387
- current_state = self.trainer.load_ui_state()
388
  current_state.update(kwargs)
389
- self.trainer.save_ui_state(current_state)
390
  # Don't return anything to avoid Gradio warnings
391
  return None
392
 
393
  def load_ui_values(self):
394
  """Load UI state values for initializing form fields"""
395
- ui_state = self.trainer.load_ui_state()
396
 
397
  # Ensure proper type conversion for numeric values
398
  ui_state["lora_rank"] = ui_state.get("lora_rank", DEFAULT_LORA_RANK_STR)
@@ -407,7 +408,7 @@ class VideoTrainerUI:
407
  # Add this new method to get initial button states:
408
  def get_initial_button_states(self):
409
  """Get the initial states for training buttons based on recovery status"""
410
- recovery_result = self.state.get("recovery_result") or self.trainer.recover_interrupted_training()
411
  ui_updates = recovery_result.get("ui_updates", {})
412
 
413
  # Check for checkpoints to determine start button text
@@ -415,7 +416,7 @@ class VideoTrainerUI:
415
 
416
  # Default button states if recovery didn't provide any
417
  if not ui_updates or not ui_updates.get("start_btn"):
418
- is_training = self.trainer.is_training_running()
419
 
420
  if is_training:
421
  # Active training detected
 
5
  import asyncio
6
  from typing import Any, Optional, Dict, List, Union, Tuple
7
 
8
+ from ..services import TrainingService, CaptioningService, SplittingService, ImportingService, PreviewingService, MonitoringService
9
  from ..config import (
10
  STORAGE_PATH, VIDEOS_TO_SPLIT_PATH, STAGING_PATH, OUTPUT_PATH,
11
  TRAINING_PATH, LOG_FILE_PATH, TRAINING_PRESETS, TRAINING_VIDEOS_PATH, MODEL_PATH, OUTPUT_PATH,
 
40
  def __init__(self):
41
  """Initialize services and tabs"""
42
  # Initialize core services
43
+ self.training = TrainingService(self)
44
+ self.splitting = SplittingService()
45
+ self.importing = ImportingService()
46
+ self.captioning = CaptioningService()
47
+ self.monitoring = MonitoringService()
48
+ self.previewing = PreviewingService()
49
 
50
  # Start the monitoring service on app creation
51
+ self.monitoring.start_monitoring()
52
 
53
  # Recovery status from any interrupted training
54
+ recovery_result = self.training.recover_interrupted_training()
55
  # Add null check for recovery_result
56
  if recovery_result is None:
57
  recovery_result = {"status": "unknown", "ui_updates": {}}
 
268
  if ui_state:
269
  current_state = self.load_ui_values()
270
  current_state.update(ui_state)
271
+ self.training.save_ui_state(current_state)
272
  logger.info(f"Updated UI state from recovery: {ui_state}")
273
 
274
  # Load values (potentially with recovery updates applied)
 
385
 
386
  def update_ui_state(self, **kwargs):
387
  """Update UI state with new values"""
388
+ current_state = self.training.load_ui_state()
389
  current_state.update(kwargs)
390
+ self.training.save_ui_state(current_state)
391
  # Don't return anything to avoid Gradio warnings
392
  return None
393
 
394
  def load_ui_values(self):
395
  """Load UI state values for initializing form fields"""
396
+ ui_state = self.training.load_ui_state()
397
 
398
  # Ensure proper type conversion for numeric values
399
  ui_state["lora_rank"] = ui_state.get("lora_rank", DEFAULT_LORA_RANK_STR)
 
408
  # Add this new method to get initial button states:
409
  def get_initial_button_states(self):
410
  """Get the initial states for training buttons based on recovery status"""
411
+ recovery_result = self.state.get("recovery_result") or self.training.recover_interrupted_training()
412
  ui_updates = recovery_result.get("ui_updates", {})
413
 
414
  # Check for checkpoints to determine start button text
 
416
 
417
  # Default button states if recovery didn't provide any
418
  if not ui_updates or not ui_updates.get("start_btn"):
419
+ is_training = self.training.is_training_running()
420
 
421
  if is_training:
422
  # Active training detected