|
""" |
|
Curated HuggingFace Diffusers optimization knowledge base |
|
Manually extracted and organized for reliable prompt injection |
|
""" |
|
|
|
OPTIMIZATION_GUIDE = """ |
|
# DIFFUSERS OPTIMIZATION TECHNIQUES |
|
|
|
## Memory Optimization Techniques |
|
|
|
### 1. Model CPU Offloading |
|
Use `enable_model_cpu_offload()` to move models between GPU and CPU automatically: |
|
```python |
|
pipe.enable_model_cpu_offload() |
|
``` |
|
- Saves significant VRAM by keeping only active models on GPU |
|
- Automatic management, no manual intervention needed |
|
- Compatible with all pipelines |
|
|
|
### 2. Sequential CPU Offloading |
|
Use `enable_sequential_cpu_offload()` for more aggressive memory saving: |
|
```python |
|
pipe.enable_sequential_cpu_offload() |
|
``` |
|
- More memory efficient than model offloading |
|
- Moves models to CPU after each forward pass |
|
- Best for very limited VRAM scenarios |
|
|
|
### 3. Attention Slicing |
|
Use `enable_attention_slicing()` to reduce memory during attention computation: |
|
```python |
|
pipe.enable_attention_slicing() |
|
# or specify slice size |
|
pipe.enable_attention_slicing("max") # maximum slicing |
|
pipe.enable_attention_slicing(1) # slice_size = 1 |
|
``` |
|
- Trades compute time for memory |
|
- Most effective for high-resolution images |
|
- Can be combined with other techniques |
|
|
|
### 4. VAE Slicing |
|
Use `enable_vae_slicing()` for large batch processing: |
|
```python |
|
pipe.enable_vae_slicing() |
|
``` |
|
- Decodes images one at a time instead of all at once |
|
- Essential for batch sizes > 4 |
|
- Minimal performance impact on single images |
|
|
|
### 5. VAE Tiling |
|
Use `enable_vae_tiling()` for high-resolution image generation: |
|
```python |
|
pipe.enable_vae_tiling() |
|
``` |
|
- Enables 4K+ image generation on 8GB VRAM |
|
- Splits images into overlapping tiles |
|
- Automatically disabled for 512x512 or smaller images |
|
|
|
### 6. Memory Efficient Attention (xFormers) |
|
Use `enable_xformers_memory_efficient_attention()` if xFormers is installed: |
|
```python |
|
pipe.enable_xformers_memory_efficient_attention() |
|
``` |
|
- Significantly reduces memory usage and improves speed |
|
- Requires xformers library installation |
|
- Compatible with most models |
|
|
|
## Performance Optimization Techniques |
|
|
|
### 1. Half Precision (FP16/BF16) |
|
Use lower precision for better memory and speed: |
|
```python |
|
# FP16 (widely supported) |
|
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) |
|
|
|
# BF16 (better numerical stability, newer hardware) |
|
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) |
|
``` |
|
- FP16: Halves memory usage, widely supported |
|
- BF16: Better numerical stability, requires newer GPUs |
|
- Essential for most optimization scenarios |
|
|
|
### 2. Torch Compile (PyTorch 2.0+) |
|
Use `torch.compile()` for significant speed improvements: |
|
```python |
|
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) |
|
# For some models, compile VAE too: |
|
pipe.vae.decode = torch.compile(pipe.vae.decode, mode="reduce-overhead", fullgraph=True) |
|
``` |
|
- 5-50% speed improvement |
|
- Requires PyTorch 2.0+ |
|
- First run is slower due to compilation |
|
|
|
### 3. Fast Schedulers |
|
Use faster schedulers for fewer steps: |
|
```python |
|
from diffusers import LMSDiscreteScheduler, UniPCMultistepScheduler |
|
|
|
# LMS Scheduler (good quality, fast) |
|
pipe.scheduler = LMSDiscreteScheduler.from_config(pipe.scheduler.config) |
|
|
|
# UniPC Scheduler (fastest) |
|
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) |
|
``` |
|
|
|
## Hardware-Specific Optimizations |
|
|
|
### NVIDIA GPU Optimizations |
|
```python |
|
# Enable Tensor Cores |
|
torch.backends.cudnn.benchmark = True |
|
|
|
# Optimal data type for NVIDIA |
|
torch_dtype = torch.float16 # or torch.bfloat16 for RTX 30/40 series |
|
``` |
|
|
|
### Apple Silicon (MPS) Optimizations |
|
```python |
|
# Use MPS device |
|
device = "mps" if torch.backends.mps.is_available() else "cpu" |
|
pipe = pipe.to(device) |
|
|
|
# Recommended dtype for Apple Silicon |
|
torch_dtype = torch.bfloat16 # Better than float16 on Apple Silicon |
|
|
|
# Attention slicing often helps on MPS |
|
pipe.enable_attention_slicing() |
|
``` |
|
|
|
### CPU Optimizations |
|
```python |
|
# Use float32 for CPU |
|
torch_dtype = torch.float32 |
|
|
|
# Enable optimized attention |
|
pipe.enable_attention_slicing() |
|
``` |
|
|
|
## Model-Specific Guidelines |
|
|
|
### FLUX Models |
|
- Do NOT use guidance_scale parameter (not needed for FLUX) |
|
- Use 4-8 inference steps maximum |
|
- BF16 dtype recommended |
|
- Enable attention slicing for memory optimization |
|
|
|
### Stable Diffusion XL |
|
- Enable attention slicing for high resolutions |
|
- Use refiner model sparingly to save memory |
|
- Consider VAE tiling for >1024px images |
|
|
|
### Stable Diffusion 1.5/2.1 |
|
- Very memory efficient base models |
|
- Can often run without optimizations on 8GB+ VRAM |
|
- Enable VAE slicing for batch processing |
|
|
|
## Memory Usage Estimation |
|
- FLUX.1: ~24GB for full precision, ~12GB for FP16 |
|
- SDXL: ~7GB for FP16, ~14GB for FP32 |
|
- SD 1.5: ~2GB for FP16, ~4GB for FP32 |
|
|
|
## Optimization Combinations by VRAM |
|
|
|
### 24GB+ VRAM (High-end) |
|
```python |
|
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.bfloat16) |
|
pipe = pipe.to("cuda") |
|
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) |
|
``` |
|
|
|
### 12-24GB VRAM (Mid-range) |
|
```python |
|
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) |
|
pipe = pipe.to("cuda") |
|
pipe.enable_model_cpu_offload() |
|
pipe.enable_xformers_memory_efficient_attention() |
|
``` |
|
|
|
### 8-12GB VRAM (Entry-level) |
|
```python |
|
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) |
|
pipe.enable_sequential_cpu_offload() |
|
pipe.enable_attention_slicing() |
|
pipe.enable_vae_slicing() |
|
pipe.enable_xformers_memory_efficient_attention() |
|
``` |
|
|
|
### <8GB VRAM (Low-end) |
|
```python |
|
pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16) |
|
pipe.enable_sequential_cpu_offload() |
|
pipe.enable_attention_slicing("max") |
|
pipe.enable_vae_slicing() |
|
pipe.enable_vae_tiling() |
|
``` |
|
""" |
|
|
|
|
|
def get_optimization_guide(): |
|
"""Return the curated optimization guide.""" |
|
return OPTIMIZATION_GUIDE |
|
|
|
|
|
if __name__ == "__main__": |
|
print("Optimization guide loaded successfully!") |
|
print(f"Guide length: {len(OPTIMIZATION_GUIDE)} characters") |