---
license: creativeml-openrail-m
tags:
- stable-diffusion
- stable-diffusion-diffusers
- text-to-image
widget:
- text: "A high tech solarpunk utopia in the Amazon rainforest"
  example_title: Amazon rainforest
- text: "A pikachu fine dining with a view to the Eiffel Tower"
  example_title: Pikachu in Paris
- text: "A mecha robot in a favela in expressionist style"
  example_title: Expressionist robot
- text: "an insect robot preparing a delicious meal"
  example_title: Insect robot
- text: "A small cabin on top of a snowy mountain in the style of Disney, artstation"
  example_title: Snowy disney cabin
extra_gated_prompt: |-
  This model is open access and available to all, with a CreativeML OpenRAIL-M license further specifying rights and usage.
  The CreativeML OpenRAIL License specifies: 

  1. You can't use the model to deliberately produce nor share illegal or harmful outputs or content 
  2. The authors claim no rights on the outputs you generate, you are free to use them and are accountable for their use which must not go against the provisions set in the license
  3. You may re-distribute the weights and use the model commercially and/or as a service. If you do, please be aware you have to include the same use restrictions as the ones in the license and share a copy of the CreativeML OpenRAIL-M to all your users (please read the license entirely and carefully)
  Please read the full license carefully here: https://huggingface.co/spaces/CompVis/stable-diffusion-license
      
extra_gated_heading: Please read the LICENSE to access this model
---

# Kanji Diffusion v1-4 Model Card

Kanji Diffusion is a latent text-to-image diffusion model capable of hallucinating Kanji characters given any prompt.

This weights here are intended to be used with the 🧨 Diffusers library. If you are looking for the weights to be loaded into the CompVis Stable Diffusion codebase, [come here](https://huggingface.co/CompVis/stable-diffusion-v-1-4-original)

## Model Details
- **Developed by:** Yashpreet Voladoddi
- **Model type:** Diffusion-based text-to-image generation model
- **Language(s):** English
- **Model Description:** This is a model that can be used to generate and modify images based on text prompts. It is a [Latent Diffusion Model](https://arxiv.org/abs/2112.10752) that uses a fixed, pretrained text encoder ([CLIP ViT-L/14](https://arxiv.org/abs/2103.00020)) as suggested in the [Imagen paper](https://arxiv.org/abs/2205.11487).
- **Resources for more information:** [GitHub Repository](https://github.com/CompVis/stable-diffusion)

## Examples

We recommend using [🤗's Diffusers library](https://github.com/huggingface/diffusers) to run Stable Diffusion.

### Colab
In order to run the pipeline and see how my model generates the kanji characters, follow the code flow below on Colab(on T4 GPU runtime, else it takes a long time to infer each image).
Make sure you have your Huggingface API KEY / ACCESS TOKEN for this.

import os

from google.colab import drive
drive.mount('/content/drive')
os.chdir("/content/drive/MyDrive")

!pip install diffusers
!git clone https://github.com/huggingface/diffusers
!huggingface-cli login

from diffusers import StableDiffusionPipeline
import torch
torch.cuda.empty_cache()

model_path = "yashvoladoddi37/kanji-diffusion-v1-4" 
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16, use_safetensors = True).to("cuda")
pipe.unet.load_attn_procs(model_path)
pipe.to("cuda")

prompt = "A Kanji meaning baby robot"
image = pipe(prompt).images[0]
image.save("baby-robot-kanji-v1-4.png")

### Limitations


## Training

**Training Data**

**Training Procedure**
Stable Diffusion v1-4 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training, 

- Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
- Text prompts are encoded through a ViT-L/14 text-encoder.
- The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
- The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.

- **Hardware:** Nvidia GTX 1650 4GB vRAM, 8GB RAM
- **Learning rate:** 1e-04
- the accelerate launch script on colab goes like this:
      !accelerate launch train_text_to_image_lora.py \
        --pretrained_model_name_or_path="CompVis/stable-diffusion-v1-4" \
        --dataset_name="yashvoladoddi37/kanjienglish" --caption_column="text" \
        --resolution=512 --random_flip \
        --train_batch_size=1 \
        --num_train_epochs=1 --checkpointing_steps=500 \
        --learning_rate=1e-04 --lr_scheduler="constant" --lr_warmup_steps=0 \
        --seed=42 \
        --output_dir="kanji_sakana_english" \
        --validation_prompt="A kanji meaning Elon Musk" \
        --push_to_hub