# Direct Consistency Optimization for Compositional Text-to-Image Personalization This is an official implementation of paper 'Direct Consistency Optimization for Compositional Text-to-Image Personalization' - [paper](https://arxiv.org/abs/2402.12004) - [project page](https://dco-t2i.github.io/) Our code is based on [diffusers](https://github.com/huggingface/diffusers), which we fine-tune [SDXL](https://huggingface.co/docs/diffusers/using-diffusers/sdxl) using LoRA from [peft](https://github.com/huggingface/peft) library. ## Installation We recommend to install from the source the latest version of diffusers: ```bash git clone https://github.com/huggingface/diffusers cd diffusers pip install -e . ``` Then go to the repository and install via ```bash cd dco/ pip install -r requirements.txt ``` And initialize an [🤗Accelerate](https://github.com/huggingface/accelerate/) environment with: ```bash accelerate config ``` Or for a default accelerate configuration without answering questions about your environment ```bash accelerate config default ``` Or if your environment doesn't support an interactive shell e.g. a notebook ```python from accelerate.utils import write_basic_config write_basic_config() ``` When running `accelerate config`, if we specify torch compile mode to True there can be dramatic speedups. Note also that we use PEFT library as backend for LoRA training, make sure to have `peft>=0.6.0` installed in your environment. ## Subject Personalization ### Data preparation We encourage to use **comprehensive caption** for text-to-image personlization, which provides descriptive visual details on the attributes, backgrounds, etc. Also we do not use rare token identifier (e.g., 'sks'), which may inherit the unfavorable semantics. We also train additional textual embeddings to enhance the subject fidelity. See paper for details. In `dataset/dreambooth/config.json`, we provide an example of comprehensive captions that we used: ``` 'comprehensive': { "images":[ "dataset/dreambooth/dog/00.jpg", "dataset/dreambooth/dog/01.jpg", "dataset/dreambooth/dog/02.jpg", "dataset/dreambooth/dog/03.jpg", "dataset/dreambooth/dog/04.jpg" ], "prompts": [ "a closed-up photo of a in front of trees, macro style", "a low-angle photo of a sitting on a ledge in front of blossom trees, macro style", "a photo of a sitting on a ledge in front of red wall and tree, macro style", "a photo of side-view of a sitting on a ledge in front of red wall and tree, macro style", "a photo of a sitting on a street, in front of lush trees, macro style" ], "base_prompts": [ "a closed-up photo of a dog in front of trees, macro style", "a low-angle photo of a dog sitting on a ledge in front of blossom trees, macro style", "a photo of a dog sitting on a ledge in front of red wall and tree, macro style", "a photo of side-view of a dog sitting on a ledge in front of red wall and tree, macro style", "a photo of a dog sitting on a street, in front of lush trees, macro style" ], "inserting_tokens" : [""], "initializer_tokens" : ["dog"] } ``` `images` is a list of directories for training images, `prompts` are list of training prompts with training tokens (*e.g.,* ``), and `base_prompts` are list of training prompts without new tokens. `inserting tokens` are list of learning tokens, and `initializer_tokens` are list of tokens that are used for initialization. If you do not want initializer token than put empty string (*i.e.,* `""`) in `initializer_tokens`. Note that the norm of token embeddings are rescaled after each iteration to be same as original one. ### Training scripts To train the model, run following command: ``` accelerate launch customize.py \ --config_dir="dataset/dreambooth/dog/config.json" \ --config_name="comprehensive" \ --output_dir="./output" \ --learning_rate=5e-5 \ --text_encoder_lr=5e-6 \ --dcoloss_beta=1000 \ --rank=32 \ --max_train_steps=2000 \ --checkpointing_steps=1000 \ --seed="0" \ --train_text_encoder_ti ``` Note that `--dcoloss_beta` is a hyperparameter that is used for DCO loss (1000-2000 works fine in our experiments). `--train_text_encoder_ti` is to indicate learning with textual embeddings. ### Inference To infer with reward guidance, import `RGPipe` from `reward_guidance.py`. Then load lora weights and textual embeddings: ``` import torch import os from safetensors.torch import load_file from reward_guidance import RGPipe pipe = RGPipe.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0" torch_dtype=torch.float16).to("cuda") lora_dir = "OUTPUT_DIR" # saved lora directory pipe.load_lora_weights(lora_dir) inserting_tokens = [""] # load new tokens state_dict = load_file(lora_dir+"/learned_embeds.safetensors") pipe.load_textual_inversion(state_dict["clip_l"], token=inserting_tokens, text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer) pipe.load_textual_inversion(state_dict["clip_g"], token=inserting_tokens, text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2) prompt = "A playing saxophone in sticker style" # prompt including new tokens base_prompt = "A dog playing saxophone in sticker style" # prompt without new tokens seed = 42 generator = torch.Generator("cuda").manual_seed(seed) rg_scale = 3.0 # rg scale. 0.0 for original CFG sampling if rg_scale > 0.0: image = pipe.my_gen( prompt=base_prompt, prompt_ti=prompt, generator=generator, cross_attention_kwargs={"scale": 1.0}, guidance_scale=7.5, guidance_scale_lora=rg_scale, ).images[0] else: image = pipe( prompt=prompt, generator=generator, cross_attention_kwargs={"scale": 1.0}, guidance_scale=7.5, ).images[0] image ``` ## Style Personlization ### Data Preparation We use same format as before, but we do not train textual embeddings for style personalization. The example config is given by ``` "style":{ "images" : ["dataset/styledrop/style.jpg"], "prompts": ["A person working on a laptop in flat cartoon illustration style"] } ``` ### Training scripts ``` accelerate launch customize.py \ --config_dir="dataset/styledrop/config.json" \ --config_name="style_1" \ --output_dir="./output_style" \ --learning_rate=5e-5 \ --dcoloss_beta=1000 \ --rank=64 \ --max_train_steps=1000 \ --seed="0" \ --offset_noise=0.1 ``` Note that we use `--offset_noise=0.1` to learn solid color of the style image. The inference is same as above. ## My Subject in My Style DCO fine-tuned models can be easily merged without any post-processing. Simply, add following codes during inference: ``` pipe.load_lora_weights(subject_lora_dir, adapter_name="subject") if args.text_encoder_ti: state_dict = load_file(subject_lora_dir+"/learned_embeds.safetensors") pipe.load_textual_inversion(state_dict["clip_l"], token=inserting_tokens, text_encoder=pipe.text_encoder, tokenizer=pipe.tokenizer) pipe.load_textual_inversion(state_dict["clip_g"], token=inserting_tokens, text_encoder=pipe.text_encoder_2, tokenizer=pipe.tokenizer_2) pipe.load_lora_weights(style_lora_dir, adapter_name="style") pipe.set_adapters(["subject", "style"], adapter_weights=[1.0, 1.0]) ``` ## BibTex ``` @article{lee2024direct, title={Direct Consistency Optimization for Compositional Text-to-Image Personalization}, author={Lee, Kyungmin and Kwak, Sangkyung and Sohn, Kihyuk and Shin, Jinwoo}, journal={arXiv preprint arXiv:2402.12004}, year={2024} } ```