Fashion MNIST Text-to-Image Diffusion Model
A transformer-based diffusion model trained on Fashion MNIST latent representations for text-to-image generation.
Model Information
- Architecture: Transformer-based diffusion model
- Input: 8ร8ร4 VAE latents
- Conditioning: Text embeddings (class labels)
- Training Steps: 8,500
- Dataset: Fashion MNIST 8ร8 Latents
- Framework: PyTorch
Checkpoints
model-1000.safetensors: Early training (1k steps)model-3000.safetensors: Mid training (3k steps)model-5000.safetensors: Advanced training (5k steps)model-8500.safetensors: Final model (8.5k steps)
Usage
from transformers import AutoConfig, AutoModel
import torch
# Load model
model = AutoModel.from_pretrained("shreenithi20/fmnist-t2i-diffusion")
model.eval()
# Generate images
with torch.no_grad():
generated_latents = model.generate(
text_embeddings=class_labels,
num_inference_steps=25,
guidance_scale=7.5
)
Model Architecture
- Patch Size: 1ร1
- Embedding Dimension: 384
- Transformer Layers: 12
- Attention Heads: 6
- Cross Attention Heads: 4
- MLP Multiplier: 4
- Timesteps: Continuous (beta distribution)
- Beta Distribution: a=1.0, b=2.5
Training Details
- Learning Rate: 1e-3 (Constant)
- Batch Size: 128
- Optimizer: AdamW
- Mixed Precision: Yes
- Gradient Accumulation: 1
Results
The model generates high-quality Fashion MNIST images conditioned on class labels, with 8ร8 latent resolution that can be decoded to 64ร64 pixel images.
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support