Fashion MNIST Text-to-Image Diffusion Model

A transformer-based diffusion model trained on Fashion MNIST latent representations for text-to-image generation.

Model Information

  • Architecture: Transformer-based diffusion model
  • Input: 8ร—8ร—4 VAE latents
  • Conditioning: Text embeddings (class labels)
  • Training Steps: 8,500
  • Dataset: Fashion MNIST 8ร—8 Latents
  • Framework: PyTorch

Checkpoints

  • model-1000.safetensors: Early training (1k steps)
  • model-3000.safetensors: Mid training (3k steps)
  • model-5000.safetensors: Advanced training (5k steps)
  • model-8500.safetensors: Final model (8.5k steps)

Usage

from transformers import AutoConfig, AutoModel
import torch

# Load model
model = AutoModel.from_pretrained("shreenithi20/fmnist-t2i-diffusion")
model.eval()

# Generate images
with torch.no_grad():
    generated_latents = model.generate(
        text_embeddings=class_labels,
        num_inference_steps=25,
        guidance_scale=7.5
    )

Model Architecture

  • Patch Size: 1ร—1
  • Embedding Dimension: 384
  • Transformer Layers: 12
  • Attention Heads: 6
  • Cross Attention Heads: 4
  • MLP Multiplier: 4
  • Timesteps: Continuous (beta distribution)
  • Beta Distribution: a=1.0, b=2.5

Training Details

  • Learning Rate: 1e-3 (Constant)
  • Batch Size: 128
  • Optimizer: AdamW
  • Mixed Precision: Yes
  • Gradient Accumulation: 1

Results

The model generates high-quality Fashion MNIST images conditioned on class labels, with 8ร—8 latent resolution that can be decoded to 64ร—64 pixel images.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train shreenithi20/fmnist-t2i-diffusion