Please find the slides for presentation in this link

Presentation

How to use

Caution: Prefix needs to be added

Note that due to the dataset we are fine-tuning on, only with some prefix, the style will transfer.

Example Prefix:

Still from the Anime, ...

Still from Anime Series, ...

Anime Series still of ...

We encourage people to try different prompts with different prefix.

from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0")
pipe.load_lora_weights("yifei28/sdxl-base-1.0-Conan-lora")

prompt = "Still from the Anime, Elon Musk"
image = pipe(prompt).images[0]

Background

Focus: Adapt Stable Diffusion XL (SDXL) to generate images in Conan (Anime) style
Dataset: high-resolution Conan images with GPT-4-generated descriptions (Hugging Face)
Framework: SDXL is a latent diffusion model built on Denoising Diffusion Probabilistic Models (DDPM), leveraging iterative denoising to progressively refine image generation

Dataset Overview

Shape: 1,790*2, image and text
Images: High resolution labeled images, vary from 890~1,300px
Text: GPT-4-generated captions, making it well-suited for refining SDXL’s text-image alignment capabilities in such style.
No extra data transformation needed

Model Selection & Metrics

Pretrained Model: Stable Diffusion Model is a text‐to‐image generative model based on diffusion processes. SDXL (“Stable Diffusion XL”), as an advanced version of Stable diffusion, has improved image quality with super resolution, enhanced text‐image alignment.
Fine-tuning: LoRA(Low‐Rank Adaptation) is a parameter‐efficient fine‐tuning method adapts SDXL effectively to our specific domain of Conan‐style image generation. It offers several advantages like it preserves pretrained knowledge and faster fine‐tuning, making it an efficient and cost‐effective.
Metrics: FID Score is a measure of similarity between two datasets of images. It is often used to evaluate the quality of samples of Generative Adversarial Networks. FID is calculated by computing the Fréchet distance between two Gaussians fitted to feature representations of the Inception network.

Model Engineering and Fine-tuning Logic

Baseline Installation with Default Parameters

Learning Rate = 1e-4
LoRA Rank = 4
Batch Size = 1
Mixed Precision = fp16
LR Warm-up Steps 0
Results:
- Successful style transfer
- But encountered issues, such as twisted faces and low-quality facial features.

Adjustments for More Stable Gradients
1. Lower learning rate to 1e-5
2. Change mixed precision to bf16: Offer larger numerical range with the same memory usage
3. Increase batch size to 2: Allow more data per training iteration
4. Increase LoRA rank to 16: Increase the number of learnable parameters
5. Add warm-up steps to 500: Help stabilize the gradients by starting with a smaller learning rate
Metrics Results in FID Score

LoRA Fine-tuned: 99.7 (-15%)
Base model: 117.4

Current Model Problems and Solutions

Current Problems:
1. Still have abnormal human features: twisted faces and low-quality body and facial features.
2. Current Text Captions are generated by other GPT models using Images, text quality cannot be promised
  - Most of them start with “Still from animates”, “Animated characters”, etc.
3. From experiments, simple prompts yield better results compared to additional predicates
  - Including only nouns and preps
  - Current Tokenizer CLIP only include 77 tokens
Solutions for Future Improvement:
1. Change loss function to align more with human visual perception
  - Currently using MSE (standard for DMs)
  - Incorporating Perceptual Loss, capturing more abstract information like texture, structure, and style
2. Enhance hardware and resource budget to fit with more complex prompts
  - Upgrading to better GPU, increasing larger batch size, and increasing LoRA rank to 64
3. Elaborate higher quality Text Captions
  - Enhance linguistic diversity, using more different syntactic patterns and structures

Future Opportunities

Anime & Game Industry
- AI-assisted character design → faster concept art creation.
- Style transfer for anime production & comics.
Personalized Content Creation
- Custom AI-generated avatars, wallpapers, and artworks.
- AI-powered manga generation based on text input.
Enhancing Model Performance
- Transformer-based Diffusion (DiT) for better generation quality.
- 3D model adaptation for AI-generated anime in gaming/metaverse.

Limitations and bias

Abnormal human features: twisted faces and low-quality body and facial features. Only simple prompts yield better results, cannot handle long prompts.

Training details

--validation_epochs=1
--resolution=1024
--train_text_encoder
--train_batch_size=1
--num_train_epochs=10
--checkpointing_steps=1000 --gradient_accumulation_steps=8
--learning_rate=1e-05
--lr_warmup_steps=500
--dataloader_num_workers=8
--allow_tf32
--mixed_precision="bf16" --rank=16

LoRA text2image fine-tuning - yifei28/sdxl-base-1.0-Conan-lora

These are LoRA adaption weights for stabilityai/stable-diffusion-xl-base-1.0. The weights were fine-tuned on the jugg1024/conan-gpt4-captions dataset. You can find some example images in the following.

Still from the Anime Character, Mark Zuckerberg

Still from the Anime Character, Robert Downey Jr.

Still from the Anime, Draco Malfoy

Still from the Anime, Severus Snape

LoRA for the text encoder was enabled: True.

VAE used for training: madebyollin/sdxl-vae-fp16-fix.

yifei28
/

sdxl-base-1.0-Conan-lora