Diffusers documentation
Models
Models
Diffusers contains pretrained models for popular algorithms and modules for creating the next set of diffusion models.
The primary function of these models is to denoise an input sample, by modeling the distribution .
The models are built on the base class [‘ModelMixin’] that is a torch.nn.module
with basic functionality for saving and loading models both locally and from the HuggingFace hub.
ModelMixin
Base class for all models.
ModelMixin takes care of storing the configuration of the models and handles methods for loading, downloading and saving models.
- config_name (
str
) — A filename under which the model should be stored when calling save_pretrained().
Deactivates gradient checkpointing for the current model.
Note that in other frameworks this feature can be referred to as “activation checkpointing” or “checkpoint activations”.
Disable memory efficient attention as implemented in xformers.
Activates gradient checkpointing for the current model.
Note that in other frameworks this feature can be referred to as “activation checkpointing” or “checkpoint activations”.
enable_xformers_memory_efficient_attention
< source >( attention_op: typing.Optional[typing.Callable] = None )
Enable memory efficient attention as implemented in xformers.
When this option is enabled, you should observe lower GPU memory usage and a potential speed up at inference time. Speed up at training time is not guaranteed.
Warning: When Memory Efficient Attention and Sliced attention are both enabled, the Memory Efficient Attention is used.
from_pretrained
< source >( pretrained_model_name_or_path: typing.Union[str, os.PathLike, NoneType]**kwargs )
Instantiate a pretrained pytorch model from a pre-trained model configuration.
The model is set in evaluation mode by default using model.eval()
(Dropout modules are deactivated). To train
the model, you should first set it back in training mode with model.train()
.
The warning Weights from XXX not initialized from pretrained model means that the weights of XXX do not come pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning task.
The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.
It is required to be logged in (huggingface-cli login
) when you want to use private or gated
models.
Activate the special “offline-mode” to use this method in a firewalled environment.
num_parameters
< source >( only_trainable: bool = Falseexclude_embeddings: bool = False ) →
Get number of (optionally, trainable or non-embeddings) parameters in the module.
save_pretrained
< source >( save_directory: typing.Union[str, os.PathLike]is_main_process: bool = Truesave_function: typing.Callable = Nonesafe_serialization: bool = Falsevariant: typing.Optional[str] = None )
Save a model and its configuration file to a directory, so that it can be re-loaded using the
[from_pretrained()](/docs/diffusers/main/en/api/models#diffusers.ModelMixin.from_pretrained)
class method.
UNet2DOutput
UNet2DModel
class diffusers.UNet2DModel
< source >( sample_size: typing.Union[int, typing.Tuple[int, int], NoneType] = Nonein_channels: int = 3out_channels: int = 3center_input_sample: bool = Falsetime_embedding_type: str = 'positional'freq_shift: int = 0flip_sin_to_cos: bool = Truedown_block_types: typing.Tuple[str] = ('DownBlock2D', 'AttnDownBlock2D', 'AttnDownBlock2D', 'AttnDownBlock2D')up_block_types: typing.Tuple[str] = ('AttnUpBlock2D', 'AttnUpBlock2D', 'AttnUpBlock2D', 'UpBlock2D')block_out_channels: typing.Tuple[int] = (224, 448, 672, 896)layers_per_block: int = 2mid_block_scale_factor: float = 1downsample_padding: int = 1act_fn: str = 'silu'attention_head_dim: typing.Optional[int] = 8norm_num_groups: int = 32norm_eps: float = 1e-05resnet_time_scale_shift: str = 'default'add_attention: bool = Trueclass_embed_type: typing.Optional[str] = Nonenum_class_embeds: typing.Optional[int] = None )
UNet2DModel is a 2D UNet model that takes in a noisy sample and a timestep and returns sample shaped output.
This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)
forward
< source >( sample: FloatTensortimestep: typing.Union[torch.Tensor, float, int]class_labels: typing.Optional[torch.Tensor] = Nonereturn_dict: bool = True ) →
UNet1DOutput
UNet1DModel
class diffusers.UNet1DModel
< source >( sample_size: int = 65536sample_rate: typing.Optional[int] = Nonein_channels: int = 2out_channels: int = 2extra_in_channels: int = 0time_embedding_type: str = 'fourier'flip_sin_to_cos: bool = Trueuse_timestep_embedding: bool = Falsefreq_shift: float = 0.0down_block_types: typing.Tuple[str] = ('DownBlock1DNoSkip', 'DownBlock1D', 'AttnDownBlock1D')up_block_types: typing.Tuple[str] = ('AttnUpBlock1D', 'UpBlock1D', 'UpBlock1DNoSkip')mid_block_type: typing.Tuple[str] = 'UNetMidBlock1D'out_block_type: str = Noneblock_out_channels: typing.Tuple[int] = (32, 32, 64)act_fn: str = Nonenorm_num_groups: int = 8layers_per_block: int = 1downsample_each_block: bool = False )
UNet1DModel is a 1D UNet model that takes in a noisy sample and a timestep and returns sample shaped output.
This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)
forward
< source >( sample: FloatTensortimestep: typing.Union[torch.Tensor, float, int]return_dict: bool = True ) →
UNet2DConditionOutput
class diffusers.models.unet_2d_condition.UNet2DConditionOutput
< source >( sample: FloatTensor )
UNet2DConditionModel
class diffusers.UNet2DConditionModel
< source >( sample_size: typing.Optional[int] = Nonein_channels: int = 4out_channels: int = 4center_input_sample: bool = Falseflip_sin_to_cos: bool = Truefreq_shift: int = 0down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D')mid_block_type: typing.Optional[str] = 'UNetMidBlock2DCrossAttn'up_block_types: typing.Tuple[str] = ('UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D')only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = Falseblock_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280)layers_per_block: typing.Union[int, typing.Tuple[int]] = 2downsample_padding: int = 1mid_block_scale_factor: float = 1act_fn: str = 'silu'norm_num_groups: typing.Optional[int] = 32norm_eps: float = 1e-05cross_attention_dim: typing.Union[int, typing.Tuple[int]] = 1280encoder_hid_dim: typing.Optional[int] = Noneencoder_hid_dim_type: typing.Optional[str] = Noneattention_head_dim: typing.Union[int, typing.Tuple[int]] = 8num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = Nonedual_cross_attention: bool = Falseuse_linear_projection: bool = Falseclass_embed_type: typing.Optional[str] = Noneaddition_embed_type: typing.Optional[str] = Nonenum_class_embeds: typing.Optional[int] = Noneupcast_attention: bool = Falseresnet_time_scale_shift: str = 'default'resnet_skip_time_act: bool = Falseresnet_out_scale_factor: int = 1.0time_embedding_type: str = 'positional'time_embedding_dim: typing.Optional[int] = Nonetime_embedding_act_fn: typing.Optional[str] = Nonetimestep_post_act: typing.Optional[str] = Nonetime_cond_proj_dim: typing.Optional[int] = Noneconv_in_kernel: int = 3conv_out_kernel: int = 3projection_class_embeddings_input_dim: typing.Optional[int] = Noneclass_embeddings_concat: bool = Falsemid_block_only_cross_attention: typing.Optional[bool] = Nonecross_attention_norm: typing.Optional[str] = Noneaddition_embed_type_num_heads = 64 )
UNet2DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.
This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)
forward
< source >( sample: FloatTensortimestep: typing.Union[torch.Tensor, float, int]encoder_hidden_states: Tensorclass_labels: typing.Optional[torch.Tensor] = Nonetimestep_cond: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonecross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = Noneadded_cond_kwargs: typing.Union[typing.Dict[str, torch.Tensor], NoneType] = Nonedown_block_additional_residuals: typing.Optional[typing.Tuple[torch.Tensor]] = Nonemid_block_additional_residual: typing.Optional[torch.Tensor] = Noneencoder_attention_mask: typing.Optional[torch.Tensor] = Nonereturn_dict: bool = True ) →
Enable sliced attention computation.
When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.
set_attn_processor
< source >( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )
Disables custom attention processors and sets the default attention implementation.
UNet3DConditionOutput
class diffusers.models.unet_3d_condition.UNet3DConditionOutput
< source >( sample: FloatTensor )
UNet3DConditionModel
class diffusers.UNet3DConditionModel
< source >( sample_size: typing.Optional[int] = Nonein_channels: int = 4out_channels: int = 4down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'CrossAttnDownBlock3D', 'DownBlock3D')up_block_types: typing.Tuple[str] = ('UpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D', 'CrossAttnUpBlock3D')block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280)layers_per_block: int = 2downsample_padding: int = 1mid_block_scale_factor: float = 1act_fn: str = 'silu'norm_num_groups: typing.Optional[int] = 32norm_eps: float = 1e-05cross_attention_dim: int = 1024attention_head_dim: typing.Union[int, typing.Tuple[int]] = 64num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = None )
UNet3DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.
This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)
forward
< source >( sample: FloatTensortimestep: typing.Union[torch.Tensor, float, int]encoder_hidden_states: Tensorclass_labels: typing.Optional[torch.Tensor] = Nonetimestep_cond: typing.Optional[torch.Tensor] = Noneattention_mask: typing.Optional[torch.Tensor] = Nonecross_attention_kwargs: typing.Union[typing.Dict[str, typing.Any], NoneType] = Nonedown_block_additional_residuals: typing.Optional[typing.Tuple[torch.Tensor]] = Nonemid_block_additional_residual: typing.Optional[torch.Tensor] = Nonereturn_dict: bool = True ) →
Enable sliced attention computation.
When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.
set_attn_processor
< source >( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )
Disables custom attention processors and sets the default attention implementation.
DecoderOutput
Output of decoding method.
VQEncoderOutput
Output of VQModel encoding method.
VQModel
class diffusers.VQModel
< source >( in_channels: int = 3out_channels: int = 3down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',)up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',)block_out_channels: typing.Tuple[int] = (64,)layers_per_block: int = 1act_fn: str = 'silu'latent_channels: int = 3sample_size: int = 32num_vq_embeddings: int = 256norm_num_groups: int = 32vq_embed_dim: typing.Optional[int] = Nonescaling_factor: float = 0.18215norm_type: str = 'group' )
VQ-VAE model from the paper Neural Discrete Representation Learning by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu.
This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)
AutoencoderKLOutput
class diffusers.models.autoencoder_kl.AutoencoderKLOutput
< source >( latent_dist: DiagonalGaussianDistribution )
Output of AutoencoderKL encoding method.
AutoencoderKL
class diffusers.AutoencoderKL
< source >( in_channels: int = 3out_channels: int = 3down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',)up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',)block_out_channels: typing.Tuple[int] = (64,)layers_per_block: int = 1act_fn: str = 'silu'latent_channels: int = 4norm_num_groups: int = 32sample_size: int = 32scaling_factor: float = 0.18215 )
Variational Autoencoder (VAE) model with KL loss from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma and Max Welling.
This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)
Disable sliced VAE decoding. If enable_slicing
was previously invoked, this method will go back to computing
decoding in one step.
Disable tiled VAE decoding. If enable_vae_tiling
was previously invoked, this method will go back to
computing decoding in one step.
Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes.
Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful to save a large amount of memory and to allow the processing of larger images.
forward
< source >( sample: FloatTensorsample_posterior: bool = Falsereturn_dict: bool = Truegenerator: typing.Optional[torch._C.Generator] = None )
set_attn_processor
< source >( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )
Disables custom attention processors and sets the default attention implementation.
Decode a batch of images using a tiled decoder.
Encode a batch of images using a tiled encoder.
Transformer2DModel
class diffusers.Transformer2DModel
< source >( num_attention_heads: int = 16attention_head_dim: int = 88in_channels: typing.Optional[int] = Noneout_channels: typing.Optional[int] = Nonenum_layers: int = 1dropout: float = 0.0norm_num_groups: int = 32cross_attention_dim: typing.Optional[int] = Noneattention_bias: bool = Falsesample_size: typing.Optional[int] = Nonenum_vector_embeds: typing.Optional[int] = Nonepatch_size: typing.Optional[int] = Noneactivation_fn: str = 'geglu'num_embeds_ada_norm: typing.Optional[int] = Noneuse_linear_projection: bool = Falseonly_cross_attention: bool = Falseupcast_attention: bool = Falsenorm_type: str = 'layer_norm'norm_elementwise_affine: bool = True )
Transformer model for image-like data. Takes either discrete (classes of vector embeddings) or continuous (actual embeddings) inputs.
When input is continuous: First, project the input (aka embedding) and reshape to b, t, d. Then apply standard transformer action. Finally, reshape to image.
When input is discrete: First, input (classes of latent pixels) is converted to embeddings and has positional
embeddings applied, see ImagePositionalEmbeddings
. Then apply standard transformer action. Finally, predict
classes of unnoised image.
Note that it is assumed one of the input classes is the masked latent pixel. The predicted classes of the unnoised image do not contain a prediction for the masked pixel as the unnoised image cannot be masked.
forward
< source >( hidden_states: Tensorencoder_hidden_states: typing.Optional[torch.Tensor] = Nonetimestep: typing.Optional[torch.LongTensor] = Noneclass_labels: typing.Optional[torch.LongTensor] = Nonecross_attention_kwargs: typing.Dict[str, typing.Any] = Noneattention_mask: typing.Optional[torch.Tensor] = Noneencoder_attention_mask: typing.Optional[torch.Tensor] = Nonereturn_dict: bool = True ) →
Transformer2DModelOutput
class diffusers.models.transformer_2d.Transformer2DModelOutput
< source >( sample: FloatTensor )
TransformerTemporalModel
class diffusers.models.transformer_temporal.TransformerTemporalModel
< source >( num_attention_heads: int = 16attention_head_dim: int = 88in_channels: typing.Optional[int] = Noneout_channels: typing.Optional[int] = Nonenum_layers: int = 1dropout: float = 0.0norm_num_groups: int = 32cross_attention_dim: typing.Optional[int] = Noneattention_bias: bool = Falsesample_size: typing.Optional[int] = Noneactivation_fn: str = 'geglu'norm_elementwise_affine: bool = Truedouble_self_attention: bool = True )
Transformer model for video-like data.
forward
< source >( hidden_statesencoder_hidden_states = Nonetimestep = Noneclass_labels = Nonenum_frames = 1cross_attention_kwargs = Nonereturn_dict: bool = True ) →
Transformer2DModelOutput
class diffusers.models.transformer_temporal.TransformerTemporalModelOutput
< source >( sample: FloatTensor )
PriorTransformer
class diffusers.PriorTransformer
< source >( num_attention_heads: int = 32attention_head_dim: int = 64num_layers: int = 20embedding_dim: int = 768num_embeddings = 77additional_embeddings = 4dropout: float = 0.0 )
The prior transformer from unCLIP is used to predict CLIP image embeddings from CLIP text embeddings. Note that the transformer predicts the image embeddings through a denoising diffusion process.
This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)
For more details, see the original paper: https://arxiv.org/abs/2204.06125
forward
< source >( hidden_statestimestep: typing.Union[torch.Tensor, float, int]proj_embedding: FloatTensorencoder_hidden_states: FloatTensorattention_mask: typing.Optional[torch.BoolTensor] = Nonereturn_dict: bool = True ) →
set_attn_processor
< source >( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )
Disables custom attention processors and sets the default attention implementation.
PriorTransformerOutput
class diffusers.models.prior_transformer.PriorTransformerOutput
< source >( predicted_image_embedding: FloatTensor )
ControlNetOutput
class diffusers.models.controlnet.ControlNetOutput
< source >( down_block_res_samples: typing.Tuple[torch.Tensor]mid_block_res_sample: Tensor )
ControlNetModel
class diffusers.ControlNetModel
< source >( in_channels: int = 4conditioning_channels: int = 3flip_sin_to_cos: bool = Truefreq_shift: int = 0down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D')only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = Falseblock_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280)layers_per_block: int = 2downsample_padding: int = 1mid_block_scale_factor: float = 1act_fn: str = 'silu'norm_num_groups: typing.Optional[int] = 32norm_eps: float = 1e-05cross_attention_dim: int = 1280attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = Noneuse_linear_projection: bool = Falseclass_embed_type: typing.Optional[str] = Nonenum_class_embeds: typing.Optional[int] = Noneupcast_attention: bool = Falseresnet_time_scale_shift: str = 'default'projection_class_embeddings_input_dim: typing.Optional[int] = Nonecontrolnet_conditioning_channel_order: str = 'rgb'conditioning_embedding_out_channels: typing.Optional[typing.Tuple[int]] = (16, 32, 96, 256)global_pool_conditions: bool = False )
from_unet
< source >( unet: UNet2DConditionModelcontrolnet_conditioning_channel_order: str = 'rgb'conditioning_embedding_out_channels: typing.Optional[typing.Tuple[int]] = (16, 32, 96, 256)load_weights_from_unet: bool = True )
Instantiate Controlnet class from UNet2DConditionModel.
Enable sliced attention computation.
When this option is enabled, the attention module will split the input tensor in slices, to compute attention in several steps. This is useful to save some memory in exchange for a small speed decrease.
set_attn_processor
< source >( processor: typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor, typing.Dict[str, typing.Union[diffusers.models.attention_processor.AttnProcessor, diffusers.models.attention_processor.AttnProcessor2_0, diffusers.models.attention_processor.XFormersAttnProcessor, diffusers.models.attention_processor.SlicedAttnProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor, diffusers.models.attention_processor.SlicedAttnAddedKVProcessor, diffusers.models.attention_processor.AttnAddedKVProcessor2_0, diffusers.models.attention_processor.XFormersAttnAddedKVProcessor, diffusers.models.attention_processor.LoRAAttnProcessor, diffusers.models.attention_processor.LoRAXFormersAttnProcessor, diffusers.models.attention_processor.LoRAAttnProcessor2_0, diffusers.models.attention_processor.LoRAAttnAddedKVProcessor, diffusers.models.attention_processor.CustomDiffusionAttnProcessor, diffusers.models.attention_processor.CustomDiffusionXFormersAttnProcessor]]] )
Disables custom attention processors and sets the default attention implementation.
FlaxModelMixin
Base class for all flax models.
FlaxModelMixin takes care of storing the configuration of the models and handles methods for loading, downloading and saving models.
from_pretrained
< source >( pretrained_model_name_or_path: typing.Union[str, os.PathLike]dtype: dtype = <class 'jax.numpy.float32'>*model_args**kwargs )
Instantiate a pretrained flax model from a pre-trained model configuration.
The warning Weights from XXX not initialized from pretrained model means that the weights of XXX do not come pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning task.
The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.
save_pretrained
< source >( save_directory: typing.Union[str, os.PathLike]params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict]is_main_process: bool = True )
Save a model and its configuration file to a directory, so that it can be re-loaded using the
[from_pretrained()](/docs/diffusers/main/en/api/models#diffusers.FlaxModelMixin.from_pretrained)
class method
to_bf16
< source >( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict]mask: typing.Any = None )
Cast the floating-point params
to jax.numpy.bfloat16
. This returns a new params
tree and does not cast
the params
in place.
This method can be used on TPU to explicitly convert the model parameters to bfloat16 precision to do full half-precision training or to save weights in bfloat16 for inference in order to save memory and improve speed.
to_fp16
< source >( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict]mask: typing.Any = None )
Cast the floating-point params
to jax.numpy.float16
. This returns a new params
tree and does not cast the
params
in place.
This method can be used on GPU to explicitly convert the model parameters to float16 precision to do full half-precision training or to save weights in float16 for inference in order to save memory and improve speed.
to_fp32
< source >( params: typing.Union[typing.Dict, flax.core.frozen_dict.FrozenDict]mask: typing.Any = None )
Cast the floating-point params
to jax.numpy.float32
. This method can be used to explicitly convert the
model parameters to fp32 precision. This returns a new params
tree and does not cast the params
in place.
FlaxUNet2DConditionOutput
class diffusers.models.unet_2d_condition_flax.FlaxUNet2DConditionOutput
< source >( sample: ndarray )
“Returns a new object replacing the specified fields with new values.
FlaxUNet2DConditionModel
class diffusers.FlaxUNet2DConditionModel
< source >( sample_size: int = 32in_channels: int = 4out_channels: int = 4down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D')up_block_types: typing.Tuple[str] = ('UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D')only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = Falseblock_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280)layers_per_block: int = 2attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = Nonecross_attention_dim: int = 1280dropout: float = 0.0use_linear_projection: bool = Falsedtype: dtype = <class 'jax.numpy.float32'>flip_sin_to_cos: bool = Truefreq_shift: int = 0use_memory_efficient_attention: bool = Falseparent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310>name: str = None )
FlaxUNet2DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.
This model inherits from FlaxModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)
Also, this model is a Flax Linen flax.linen.Module subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
FlaxDecoderOutput
Output of decoding method.
“Returns a new object replacing the specified fields with new values.
FlaxAutoencoderKLOutput
class diffusers.models.vae_flax.FlaxAutoencoderKLOutput
< source >( latent_dist: FlaxDiagonalGaussianDistribution )
Output of AutoencoderKL encoding method.
“Returns a new object replacing the specified fields with new values.
FlaxAutoencoderKL
class diffusers.FlaxAutoencoderKL
< source >( in_channels: int = 3out_channels: int = 3down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',)up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',)block_out_channels: typing.Tuple[int] = (64,)layers_per_block: int = 1act_fn: str = 'silu'latent_channels: int = 4norm_num_groups: int = 32sample_size: int = 32scaling_factor: float = 0.18215dtype: dtype = <class 'jax.numpy.float32'>parent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310>name: str = None )
Flax Implementation of Variational Autoencoder (VAE) model with KL loss from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma and Max Welling.
This model is a Flax Linen flax.linen.Module subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as:
FlaxControlNetOutput
class diffusers.models.controlnet_flax.FlaxControlNetOutput
< source >( down_block_res_samples: ndarraymid_block_res_sample: ndarray )
“Returns a new object replacing the specified fields with new values.
FlaxControlNetModel
class diffusers.FlaxControlNetModel
< source >( sample_size: int = 32in_channels: int = 4down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D')only_cross_attention: typing.Union[bool, typing.Tuple[bool]] = Falseblock_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280)layers_per_block: int = 2attention_head_dim: typing.Union[int, typing.Tuple[int]] = 8num_attention_heads: typing.Union[int, typing.Tuple[int], NoneType] = Nonecross_attention_dim: int = 1280dropout: float = 0.0use_linear_projection: bool = Falsedtype: dtype = <class 'jax.numpy.float32'>flip_sin_to_cos: bool = Truefreq_shift: int = 0controlnet_conditioning_channel_order: str = 'rgb'conditioning_embedding_out_channels: typing.Tuple[int] = (16, 32, 96, 256)parent: typing.Union[typing.Type[flax.linen.module.Module], typing.Type[flax.core.scope.Scope], typing.Type[flax.linen.module._Sentinel], NoneType] = <flax.linen.module._Sentinel object at 0x7fa57076a310>name: str = None )
Quoting from https://arxiv.org/abs/2302.05543: “Stable Diffusion uses a pre-processing method similar to VQ-GAN [11] to convert the entire dataset of 512 × 512 images into smaller 64 × 64 “latent images” for stabilized training. This requires ControlNets to convert image-based conditions to 64 × 64 feature space to match the convolution size. We use a tiny network E(·) of four convolution layers with 4 × 4 kernels and 2 × 2 strides (activated by ReLU, channels are 16, 32, 64, 128, initialized with Gaussian weights, trained jointly with the full model) to encode image-space conditions … into feature maps …”
This model inherits from FlaxModelMixin. Check the superclass documentation for the generic methods the library implements for all the models (such as downloading or saving, etc.)
Also, this model is a Flax Linen flax.linen.Module subclass. Use it as a regular Flax linen Module and refer to the Flax documentation for all matter related to general usage and behavior.
Finally, this model supports inherent JAX features such as: