Diffusers documentation

Models

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.32.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Models

Diffusers contains pretrained models for popular algorithms and modules for creating the next set of diffusion models. The primary function of these models is to denoise an input sample, by modeling the distribution . The models are built on the base class [‘ModelMixin’] that is a torch.nn.module with basic functionality for saving and loading models both locally and from the HuggingFace hub.

ModelMixin

class diffusers.ModelMixin

< >

( )

Base class for all models.

ModelMixin takes care of storing the configuration of the models and handles methods for loading, downloading and saving models.

  • config_name (str) — A filename under which the model should be stored when calling save_pretrained().

disable_gradient_checkpointing

< >

( )

Deactivates gradient checkpointing for the current model.

Note that in other frameworks this feature can be referred to as “activation checkpointing” or “checkpoint activations”.

disable_xformers_memory_efficient_attention

< >

( )

Disable memory efficient attention as implemented in xformers.

enable_gradient_checkpointing

< >

( )

Activates gradient checkpointing for the current model.

Note that in other frameworks this feature can be referred to as “activation checkpointing” or “checkpoint activations”.

enable_xformers_memory_efficient_attention

< >

( attention_op: typing.Optional[typing.Callable] = None )

Parameters

Enable memory efficient attention as implemented in xformers.

When this option is enabled, you should observe lower GPU memory usage and a potential speed up at inference time. Speed up at training time is not guaranteed.

Warning: When Memory Efficient Attention and Sliced attention are both enabled, the Memory Efficient Attention is used.

Examples:

from_pretrained

< >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike, NoneType]**kwargs )

Parameters

Instantiate a pretrained pytorch model from a pre-trained model configuration.

The model is set in evaluation mode by default using model.eval() (Dropout modules are deactivated). To train the model, you should first set it back in training mode with model.train().

The warning Weights from XXX not initialized from pretrained model means that the weights of XXX do not come pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning task.

The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.

It is required to be logged in (huggingface-cli login) when you want to use private or gated models.

Activate the special “offline-mode” to use this method in a firewalled environment.

num_parameters

< >

( only_trainable: bool = Falseexclude_embeddings: bool = False )

Parameters

Returns

int

Get number of (optionally, trainable or non-embeddings) parameters in the module.

save_pretrained

< >

( save_directory: typing.Union[str, os.PathLike]is_main_process: bool = Truesave_function: typing.Callable = Nonesafe_serialization: bool = Falsevariant: typing.Optional[str] = None )

Parameters

Save a model and its configuration file to a directory, so that it can be re-loaded using the [from_pretrained()](/docs/diffusers/main/en/api/models#diffusers.ModelMixin.from_pretrained) class method.

UNet2DOutput

class diffusers.models.unet_2d.UNet2DOutput

< >

( sample: FloatTensor )

Parameters

UNet2DModel

class diffusers.UNet2DModel

< >

( sample_size: typing.Union[int, typing.Tuple[int, int], NoneType] = Nonein_channels: int = 3out_channels: int = 3center_input_sample: bool = Falsetime_embedding_type: str = 'positional'freq_shift: int = 0flip_sin_to_cos: bool = Truedown_block_types: typing.Tuple[str] = ('DownBlock2D', 'AttnDownBlock2D', 'AttnDownBlock2D', 'AttnDownBlock2D')up_block_types: typing.Tuple[str] = ('AttnUpBlock2D', 'AttnUpBlock2D', 'AttnUpBlock2D', 'UpBlock2D')block_out_channels: typing.Tuple[int] = (224, 448, 672, 896)layers_per_block: int = 2mid_block_scale_factor: float = 1downsample_padding: int = 1act_fn: str = 'silu'attention_head_dim: typing.Optional[int] = 8norm_num_groups: int = 32norm_eps: float = 1e-05resnet_time_scale_shift: str = 'default'add_attention: bool = Trueclass_embed_type: typing.Optional[str] = Nonenum_class_embeds: typing.Optional[int] = None )

Parameters

UNet2DModel is a 2D UNet model that takes in a noisy sample and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< >

( sample: FloatTensortimestep: typing.Union[torch.Tensor, float, int]class_labels: typing.Optional[torch.Tensor] = Nonereturn_dict: bool = True )

Parameters

Returns

UNet2DOutput or tuple

UNet1DOutput

class diffusers.models.unet_1d.UNet1DOutput

< >

( sample: FloatTensor )

Parameters

UNet1DModel

class diffusers.UNet1DModel

< >

( sample_size: int = 65536sample_rate: typing.Optional[int] = Nonein_channels: int = 2out_channels: int = 2extra_in_channels: int = 0time_embedding_type: str = 'fourier'flip_sin_to_cos: bool = Trueuse_timestep_embedding: bool = Falsefreq_shift: float = 0.0down_block_types: typing.Tuple[str] = ('DownBlock1DNoSkip', 'DownBlock1D', 'AttnDownBlock1D')up_block_types: typing.Tuple[str] = ('AttnUpBlock1D', 'UpBlock1D', 'UpBlock1DNoSkip')mid_block_type: typing.Tuple[str] = 'UNetMidBlock1D'out_block_type: str = Noneblock_out_channels: typing.Tuple[int] = (32, 32, 64)act_fn: str = Nonenorm_num_groups: int = 8layers_per_block: int = 1downsample_each_block: bool = False )

Parameters

UNet1DModel is a 1D UNet model that takes in a noisy sample and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< >

( sample: FloatTensortimestep: typing.Union[torch.Tensor, float, int]return_dict: bool = True )

Parameters

Returns

UNet1DOutput or tuple

UNet2DConditionOutput

class diffusers.models.unet_2d_condition.UNet2DConditionOutput

< >

( sample: FloatTensor )

Parameters

UNet2DConditionModel

class diffusers.UNet2DConditionModel

< >

( sample_size: typing.Optional[int] = Nonein_channels: int = 4out_channels: int = 4center_input_sample: bool = Falseflip_sin_to_cos: bool = Truefreq_shift: int = 0down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D')mid_block_type: typing.Optional[str] = 'UNetMidBlock2DCrossAttn'up_block_types: typing.Tuple[str] = ('UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D')only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = Falseblock_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280)layers_per_block: typing.Union[int, typing.Tuple[int]] = 2downsample_padding: int = 1mid_block_scale_factor: float = 1act_fn: str = 'silu'norm_num_groups: typing.Optional[int] = 32norm_eps: float = 1e-05cross_attention_dim: typing.Union[int, typing.Tuple[int]] = 1280encoder_hid_dim: typing.Optional[int] = Noneencoder_hid_dim_type: typing.Optional[str] = Noneattention_head_dim: typing.Union[int, typing.Tuple[int]] = 8num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = Nonedual_cross_attention: bool = Falseuse_linear_projection: bool = Falseclass_embed_type: typing.Optional[str] = Noneaddition_embed_type: typing.Optional[str] = Nonenum_class_embeds: typing.Optional[int] = Noneupcast_attention: bool = Falseresnet_time_scale_shift: str = 'default'resnet_skip_time_act: bool = Falseresnet_out_scale_factor: int = 1.0time_embedding_type: str = 'positional'time_embedding_dim: typing.Optional[int] = Nonetime_embedding_act_fn: typing.Optional[str] = Nonetimestep_post_act: typing.Optional[str] = Nonetime_cond_proj_dim: typing.Optional[int] = Noneconv_in_kernel: int = 3conv_out_kernel: int = 3projection_class_embeddings_input_dim: typing.Optional[int] = Noneclass_embeddings_concat: bool = Falsemid_block_only_cross_attention: typing.Optional[bool] = Nonecross_attention_norm: typing.Optional[str] = Noneaddition_embed_type_num_heads = 64 )

Parameters

UNet2DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

forward

< >

( sample: FloatTensortimestep: typing.Union[torch.Tensor, float, int]encoder_hidden_states: Tensorclass_labels: typing.Optional[torch.Tensor] = Nonetimestep_cond: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonecross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = Noneadded_cond_kwargs: typing.Union[typing.Dict[str, torch.Tensor], NoneType] = Nonedown_block_additional_residuals: typing.Optional[typing.Tuple[torch.Tensor]] = Nonemid_block_additional_residual: typing.Optional[torch.Tensor] = Noneencoder_attention_mask: typing.Optional[torch.Tensor] = Nonereturn_dict: bool = True )

Parameters

Returns

UNet2DConditionOutput or tuple

set_attention_slice

< >

( slice_size )

Parameters

Enable sliced attention computation.

When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.

set_attn_processor

< >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )

Parameters

set_default_attn_processor

< >

( )

Disables custom attention processors and sets the default attention implementation.

UNet3DConditionOutput

class diffusers.models.unet_3d_condition.UNet3DConditionOutput

< >

( sample: FloatTensor )

Parameters

UNet3DConditionModel

class diffusers.UNet3DConditionModel

< >

( sample_size: typing.Optional[int] = Nonein_channels: int = 4out_channels: int = 4down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'DownBlock3D')up_block_types: typing.Tuple[str] = ('UpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D')block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280)layers_per_block: int = 2downsample_padding: int = 1mid_block_scale_factor: float = 1act_fn: str = 'silu'norm_num_groups: typing.Optional[int] = 32norm_eps: float = 1e-05cross_attention_dim: int = 1024attention_head_dim: typing.Union[int, typing.Tuple[int]] = 64num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None )

Parameters

UNet3DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

forward

< >

( sample: FloatTensortimestep: typing.Union[torch.Tensor, float, int]encoder_hidden_states: Tensorclass_labels: typing.Optional[torch.Tensor] = Nonetimestep_cond: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonecross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = Nonedown_block_additional_residuals: typing.Optional[typing.Tuple[torch.Tensor]] = Nonemid_block_additional_residual: typing.Optional[torch.Tensor] = Nonereturn_dict: bool = True )

Parameters

Returns

~models.unet_2d_condition.UNet3DConditionOutput or tuple

set_attention_slice

< >

( slice_size )

Parameters

Enable sliced attention computation.

When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.

set_attn_processor

< >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )

Parameters

set_default_attn_processor

< >

( )

Disables custom attention processors and sets the default attention implementation.

DecoderOutput

class diffusers.models.vae.DecoderOutput

< >

( sample: FloatTensor )

Parameters

Output of decoding method.

VQEncoderOutput

class diffusers.models.vq_model.VQEncoderOutput

< >

( latents: FloatTensor )

Parameters

Output of VQModel encoding method.

VQModel

class diffusers.VQModel

< >

( in_channels: int = 3out_channels: int = 3down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',)up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',)block_out_channels: typing.Tuple[int] = (64,)layers_per_block: int = 1act_fn: str = 'silu'latent_channels: int = 3sample_size: int = 32num_vq_embeddings: int = 256norm_num_groups: int = 32vq_embed_dim: typing.Optional[int] = Nonescaling_factor: float = 0.18215norm_type: str = 'group' )

Parameters

VQ-VAE model from the paper Neural Discrete Representation Learning by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< >

( sample: FloatTensorreturn_dict: bool = True )

Parameters

AutoencoderKLOutput

class diffusers.models.autoencoder_kl.AutoencoderKLOutput

< >

( latent_dist: DiagonalGaussianDistribution )

Parameters

Output of AutoencoderKL encoding method.

AutoencoderKL

class diffusers.AutoencoderKL

< >

( in_channels: int = 3out_channels: int = 3down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',)up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',)block_out_channels: typing.Tuple[int] = (64,)layers_per_block: int = 1act_fn: str = 'silu'latent_channels: int = 4norm_num_groups: int = 32sample_size: int = 32scaling_factor: float = 0.18215 )

Parameters

Variational Autoencoder (VAE) model with KL loss from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma and Max Welling.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

disable_slicing

< >

( )

Disable sliced VAE decoding. If enable_slicing was previously invoked, this method will go back to computing decoding in one step.

disable_tiling

< >

( )

Disable tiled VAE decoding. If enable_vae_tiling was previously invoked, this method will go back to computing decoding in one step.

enable_slicing

< >

( )

Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.

enable_tiling

< >

( use_tiling: bool = True )

Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful to save a large amount of memory and to allow the processing of larger images.

forward

< >

( sample: FloatTensorsample_posterior: bool = Falsereturn_dict: bool = Truegenerator: typing.Optional[torch._C.Generator] = None )

Parameters

set_attn_processor

< >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )

Parameters

set_default_attn_processor

< >

( )

Disables custom attention processors and sets the default attention implementation.

tiled_decode

< >

( z: FloatTensorreturn_dict: bool = True )

Parameters

Decode a batch of images using a tiled decoder.

tiled_encode

< >

( x: FloatTensorreturn_dict: bool = True )

Parameters

Encode a batch of images using a tiled encoder.

Transformer2DModel

class diffusers.Transformer2DModel

< >

( num_attention_heads: int = 16attention_head_dim: int = 88in_channels: typing.Optional[int] = Noneout_channels: typing.Optional[int] = Nonenum_layers: int = 1dropout: float = 0.0norm_num_groups: int = 32cross_attention_dim: typing.Optional[int] = Noneattention_bias: bool = Falsesample_size: typing.Optional[int] = Nonenum_vector_embeds: typing.Optional[int] = Nonepatch_size: typing.Optional[int] = Noneactivation_fn: str = 'geglu'num_embeds_ada_norm: typing.Optional[int] = Noneuse_linear_projection: bool = Falseonly_cross_attention: bool = Falseupcast_attention: bool = Falsenorm_type: str = 'layer_norm'norm_elementwise_affine: bool = True )

Parameters

Transformer model for image-like data. Takes either discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.

When input is continuous: First, project the input (aka embedding) and reshape to b, t, d. Then apply standard transformer action. Finally, reshape to image.

When input is discrete: First, input (classes of latent pixels) is converted to embeddings and has positional embeddings applied, see ImagePositionalEmbeddings. Then apply standard transformer action. Finally, predict classes of unnoised image.

Note that it is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image do not contain a prediction for the masked pixel as the unnoised image cannot be masked.

forward

< >

( hidden_states: Tensorencoder_hidden_states: typing.Optional[torch.Tensor] = Nonetimestep: typing.Optional[torch.LongTensor] = Noneclass_labels: typing.Optional[torch.LongTensor] = Nonecross_attention_kwargs: typing.Dict[str, typing.Any] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneencoder_attention_mask: typing.Optional[torch.Tensor] = Nonereturn_dict: bool = True )

Parameters

Returns

Transformer2DModelOutput or tuple

Transformer2DModelOutput

class diffusers.models.transformer_2d.Transformer2DModelOutput

< >

( sample: FloatTensor )

Parameters

TransformerTemporalModel

class diffusers.models.transformer_temporal.TransformerTemporalModel

< >

( num_attention_heads: int = 16attention_head_dim: int = 88in_channels: typing.Optional[int] = Noneout_channels: typing.Optional[int] = Nonenum_layers: int = 1dropout: float = 0.0norm_num_groups: int = 32cross_attention_dim: typing.Optional[int] = Noneattention_bias: bool = Falsesample_size: typing.Optional[int] = Noneactivation_fn: str = 'geglu'norm_elementwise_affine: bool = Truedouble_self_attention: bool = True )

Parameters

Transformer model for video-like data.

forward

< >

( hidden_statesencoder_hidden_states = Nonetimestep = Noneclass_labels = Nonenum_frames = 1cross_attention_kwargs = Nonereturn_dict: bool = True )

Parameters

Returns

~models.transformer_2d.TransformerTemporalModelOutput or tuple

Transformer2DModelOutput

class diffusers.models.transformer_temporal.TransformerTemporalModelOutput

< >

( sample: FloatTensor )

Parameters

PriorTransformer

class diffusers.PriorTransformer

< >

( num_attention_heads: int = 32attention_head_dim: int = 64num_layers: int = 20embedding_dim: int = 768num_embeddings = 77additional_embeddings = 4dropout: float = 0.0 )

Parameters

The prior transformer from unCLIP is used to predict CLIP image embeddings from CLIP text embeddings. Note that the transformer predicts the image embeddings through a denoising diffusion process.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

For more details, see the original paper: https://arxiv.org/abs/2204.06125

forward

< >

( hidden_statestimestep: typing.Union[torch.Tensor, float, int]proj_embedding: FloatTensorencoder_hidden_states: FloatTensorattention_mask: typing.Optional[torch.BoolTensor] = Nonereturn_dict: bool = True )

Parameters

Returns

PriorTransformerOutput or tuple

set_attn_processor

< >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )

Parameters

set_default_attn_processor

< >

( )

Disables custom attention processors and sets the default attention implementation.

PriorTransformerOutput

class diffusers.models.prior_transformer.PriorTransformerOutput

< >

( predicted_image_embedding: FloatTensor )

Parameters

ControlNetOutput

class diffusers.models.controlnet.ControlNetOutput

< >

( down_block_res_samples: typing.Tuple[torch.Tensor]mid_block_res_sample: Tensor )

ControlNetModel

class diffusers.ControlNetModel

< >

( in_channels: int = 4conditioning_channels: int = 3flip_sin_to_cos: bool = Truefreq_shift: int = 0down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D')only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = Falseblock_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280)layers_per_block: int = 2downsample_padding: int = 1mid_block_scale_factor: float = 1act_fn: str = 'silu'norm_num_groups: typing.Optional[int] = 32norm_eps: float = 1e-05cross_attention_dim: int = 1280attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = Noneuse_linear_projection: bool = Falseclass_embed_type: typing.Optional[str] = Nonenum_class_embeds: typing.Optional[int] = Noneupcast_attention: bool = Falseresnet_time_scale_shift: str = 'default'projection_class_embeddings_input_dim: typing.Optional[int] = Nonecontrolnet_conditioning_channel_order: str = 'rgb'conditioning_embedding_out_channels: typing.Optional[typing.Tuple[int]] = (16, 32, 96, 256)global_pool_conditions: bool = False )

from_unet

< >

( unet: UNet2DConditionModelcontrolnet_conditioning_channel_order: str = 'rgb'conditioning_embedding_out_channels: typing.Optional[typing.Tuple[int]] = (16, 32, 96, 256)load_weights_from_unet: bool = True )

Parameters

Instantiate Controlnet class from UNet2DConditionModel.

set_attention_slice

< >

( slice_size )

Parameters

Enable sliced attention computation.

When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.

set_attn_processor

< >

( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )

Parameters

set_default_attn_processor

< >

( )

Disables custom attention processors and sets the default attention implementation.

FlaxModelMixin

class diffusers.FlaxModelMixin

< >

( )

Base class for all flax models.

FlaxModelMixin takes care of storing the configuration of the models and handles methods for loading, downloading and saving models.

from_pretrained

< >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike]dtype: dtype = <class 'jax.numpy.float32'>*model_args**kwargs )

Parameters

Instantiate a pretrained flax model from a pre-trained model configuration.

The warning Weights from XXX not initialized from pretrained model means that the weights of XXX do not come pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning task.

The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.

Examples:

save_pretrained

< >

( save_directory: typing.Union[str, os.PathLike]params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict]is_main_process: bool = True )

Parameters

Save a model and its configuration file to a directory, so that it can be re-loaded using the [from_pretrained()](/docs/diffusers/main/en/api/models#diffusers.FlaxModelMixin.from_pretrained) class method

to_bf16

< >

( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict]mask: typing.Any = None )

Parameters

Cast the floating-point params to jax.numpy.bfloat16. This returns a new params tree and does not cast the params in place.

This method can be used on TPU to explicitly convert the model parameters to bfloat16 precision to do full half-precision training or to save weights in bfloat16 for inference in order to save memory and improve speed.

Examples:

to_fp16

< >

( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict]mask: typing.Any = None )

Parameters

Cast the floating-point params to jax.numpy.float16. This returns a new params tree and does not cast the params in place.

This method can be used on GPU to explicitly convert the model parameters to float16 precision to do full half-precision training or to save weights in float16 for inference in order to save memory and improve speed.

Examples:

to_fp32

< >

( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict]mask: typing.Any = None )

Parameters

Cast the floating-point params to jax.numpy.float32. This method can be used to explicitly convert the model parameters to fp32 precision. This returns a new params tree and does not cast the params in place.

Examples:

FlaxUNet2DConditionOutput

class diffusers.models.unet_2d_condition_flax.FlaxUNet2DConditionOutput

< >

( sample: ndarray )

Parameters

replace

< >

( **updates )

“Returns a new object replacing the specified fields with new values.

FlaxUNet2DConditionModel

class diffusers.FlaxUNet2DConditionModel

< >

( sample_size: int = 32in_channels: int = 4out_channels: int = 4down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D')up_block_types: typing.Tuple[str] = ('UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D')only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = Falseblock_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280)layers_per_block: int = 2attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = Nonecross_attention_dim: int = 1280dropout: float = 0.0use_linear_projection: bool = Falsedtype: dtype = <class 'jax.numpy.float32'>flip_sin_to_cos: bool = Truefreq_shift: int = 0use_memory_efficient_attention: bool = Falseparent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310>name: str = None )

Parameters

FlaxUNet2DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.

This model inherits from FlaxModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

Also, this model is a Flax Linen flax.linen.Module subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

FlaxDecoderOutput

class diffusers.models.vae_flax.FlaxDecoderOutput

< >

( sample: ndarray )

Parameters

Output of decoding method.

replace

< >

( **updates )

“Returns a new object replacing the specified fields with new values.

FlaxAutoencoderKLOutput

class diffusers.models.vae_flax.FlaxAutoencoderKLOutput

< >

( latent_dist: FlaxDiagonalGaussianDistribution )

Parameters

Output of AutoencoderKL encoding method.

replace

< >

( **updates )

“Returns a new object replacing the specified fields with new values.

FlaxAutoencoderKL

class diffusers.FlaxAutoencoderKL

< >

( in_channels: int = 3out_channels: int = 3down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',)up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',)block_out_channels: typing.Tuple[int] = (64,)layers_per_block: int = 1act_fn: str = 'silu'latent_channels: int = 4norm_num_groups: int = 32sample_size: int = 32scaling_factor: float = 0.18215dtype: dtype = <class 'jax.numpy.float32'>parent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310>name: str = None )

Parameters

Flax Implementation of Variational Autoencoder (VAE) model with KL loss from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma and Max Welling.

This model is a Flax Linen flax.linen.Module subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as:

FlaxControlNetOutput

class diffusers.models.controlnet_flax.FlaxControlNetOutput

< >

( down_block_res_samples: ndarraymid_block_res_sample: ndarray )

replace

< >

( **updates )

“Returns a new object replacing the specified fields with new values.

FlaxControlNetModel

class diffusers.FlaxControlNetModel

< >

( sample_size: int = 32in_channels: int = 4down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D')only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = Falseblock_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280)layers_per_block: int = 2attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = Nonecross_attention_dim: int = 1280dropout: float = 0.0use_linear_projection: bool = Falsedtype: dtype = <class 'jax.numpy.float32'>flip_sin_to_cos: bool = Truefreq_shift: int = 0controlnet_conditioning_channel_order: str = 'rgb'conditioning_embedding_out_channels: typing.Tuple[int] = (16, 32, 96, 256)parent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310>name: str = None )

Parameters

Quoting from https://arxiv.org/abs/2302.05543: “Stable Diffusion uses a pre-processing method similar to VQ-GAN [11] to convert the entire dataset of 512 × 512 images into smaller 64 × 64 “latent images” for stabilized training. This requires ControlNets to convert image-based conditions to 64 × 64 feature space to match the convolution size. We use a tiny network E(·) of four convolution layers with 4 × 4 kernels and 2 × 2 strides (activated by ReLU, channels are 16, 32, 64, 128, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions … into feature maps …”

This model inherits from FlaxModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)

Also, this model is a Flax Linen flax.linen.Module subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.

Finally, this model supports inherent JAX features such as: