Improve model card with metadata and links
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
@@ -1,3 +1,124 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-nc-4.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-nc-4.0
|
3 |
+
pipeline_tag: image-to-image
|
4 |
+
library_name: diffusers
|
5 |
+
---
|
6 |
+
|
7 |
+
# ColorizeDiffusion: Adjustable Sketch Colorization with Reference Image and Text
|
8 |
+
|
9 |
+

|
10 |
+
|
11 |
+
This model, presented in [Image Referenced Sketch Colorization Based on Animation Creation Workflow](https://hf.co/papers/2502.19937), performs sketch colorization using a reference image and text prompts. It addresses limitations of existing methods by using a diffusion-based framework inspired by animation workflows.
|
12 |
+
|
13 |
+
The abstract of the paper is the following:
|
14 |
+
|
15 |
+
Sketch colorization plays an important role in animation and digital illustration production tasks. However, existing methods still meet problems in that text-guided methods fail to provide accurate color and style reference, hint-guided methods still involve manual operation, and image-referenced methods are prone to cause artifacts. To address these limitations, we propose a diffusion-based framework inspired by real-world animation production workflows. Our approach leverages the sketch as the spatial guidance and an RGB image as the color reference, and separately extracts foreground and background from the reference image with spatial masks. Particularly, we introduce a split cross-attention mechanism with LoRA (Low-Rank Adaptation) modules. They are trained separately with foreground and background regions to control the corresponding embeddings for keys and values in cross-attention. This design allows the diffusion model to integrate information from foreground and background independently, preventing interference and eliminating the spatial artifacts. During inference, we design switchable inference modes for diverse use scenarios by changing modules activated in the framework. Extensive qualitative and quantitative experiments, along with user studies, demonstrate our advantages over existing methods in generating high-qualigy artifact-free results with geometric mismatched references. Ablation studies further confirm the effectiveness of each component.
|
16 |
+
|
17 |
+
|
18 |
+
Code: https://github.com/tellurion-kanata/colorizeDiffusion
|
19 |
+
|
20 |
+
(July. 2024)
|
21 |
+
Fundemental issue for this repository: [ColorizeDiffusion](https://arxiv.org/abs/2401.01456).
|
22 |
+
Version 1: WACV 2025 (basic reference-based training, published soon).
|
23 |
+
Version 1.5: Solving spatial entangelment [ColorizeDiffusion 1.5](https://arxiv.org/html/2502.19937v1). Corresponding code will be released in a month.
|
24 |
+
Version 2 paper and code: Enhancing background and style transfer. Available soon.
|
25 |
+
Version XL: Ongoing.
|
26 |
+
|
27 |
+
For the details of reasons why the training is organized like this, please refer to version 3 of the arxiv paper (more detailed than WACV one).
|
28 |
+
Model weights are available: https://huggingface.co/tellurion/colorizer.
|
29 |
+
|
30 |
+
## Implementation Details
|
31 |
+
The repository offers the updated implementation of ColorizeDiffusion.
|
32 |
+
Now, only the noisy model introduced in the paper, which utilizes the local tokens.
|
33 |
+
|
34 |
+
## Getting Start
|
35 |
+
To utilize the code in this repository, ensure that you have installed the required dependencies as specified in the requirements.
|
36 |
+
|
37 |
+
### To install and run:
|
38 |
+
```shell
|
39 |
+
conda env create -f environment.yaml
|
40 |
+
conda activate hf
|
41 |
+
```
|
42 |
+
|
43 |
+
## User Interface:
|
44 |
+
We also provided a Web UI based on Gradio UI. To run it, just:
|
45 |
+
```shell
|
46 |
+
python -u gradio_ui.py
|
47 |
+
```
|
48 |
+
Then you can browse the UI in http://localhost:7860/.
|
49 |
+
|
50 |
+
### Inference:
|
51 |
+
-------------------------------------------------------------------------------------------
|
52 |
+
#### Inference options:
|
53 |
+
| Options | Description |
|
54 |
+
|:--------------------------|:----------------------------------------------------------------------------------|
|
55 |
+
| Crossattn scale | Used to diminish all kinds of artifacts caused by the distribution problem. |
|
56 |
+
| Pad reference | Activate to use "pad reference with margin" |
|
57 |
+
| Pad reference with margin | Used to diminish spatial entanglement, pad reference to T times of current width. |
|
58 |
+
| Reference guidance scale | Classifier-free guidance scale of the reference image, suggested 5. |
|
59 |
+
| Sketch guidance scale | Classifier-free guidance scale of the sketch image, suggested 1. |
|
60 |
+
| Attention injection | Strengthen similarity with reference. |
|
61 |
+
| Visualize | Used for local manipulation. Visualize the regions selected by each threshold. |
|
62 |
+
|
63 |
+
For artifacts like spatial entanglement (the distribution problem discussed in the paper) like this
|
64 |
+

|
65 |
+
**We've solved the spatial entanglement in the latest implementation, will be released soon.**
|
66 |
+
|
67 |
+
For current open-source version, try:
|
68 |
+
1. Activate **Pad reference** and increase **Pad reference with margin** to around 1.5, or
|
69 |
+
2. Reduce **Overall crossattn scale** to 0.4-0.8. (Best for handling all kinds of artifacts caused by the distribution problem, but accordingly degrade the similarity with referneces)
|
70 |
+
|
71 |
+
We offer a precise control of crossattn scales, check **Accurate control** part.
|
72 |
+
|
73 |
+
A simple but effective solution to remove spatial entanglement might be directly generating amounts of semantically paired images as training data using image variation methods,
|
74 |
+
yet this semantic alignment still results in the distribution problem.
|
75 |
+
|
76 |
+
|
77 |
+
|
78 |
+
When using stylized image like ***The Starry Night*** for style transfer, try **Attention injection** with **Karras**-based sampler.
|
79 |
+

|
80 |
+
|
81 |
+
### Manipulation:
|
82 |
+
The colorization results can be manipulated using text prompts.
|
83 |
+
|
84 |
+
For local manipulations, a visualization is provided to show the correlation between each prompt and tokens in the reference image.
|
85 |
+
|
86 |
+
|
87 |
+
The manipulation result and correlation visualization of the settings:
|
88 |
+
|
89 |
+
Target prompt: the girl's blonde hair
|
90 |
+
Anchor prompt the girl's brown hair
|
91 |
+
Control prompt the girl's brown hair,
|
92 |
+
Target scale: 8
|
93 |
+
Enhanced: false
|
94 |
+
Thresholds: 0.5、0.55、0.65、0.95
|
95 |
+
|
96 |
+

|
97 |
+

|
98 |
+
As you can see, the manipluation unavoidably changed some unrelated regions as it is taken on the reference embeddings.
|
99 |
+
|
100 |
+
#### Manipulation options:
|
101 |
+
| Options | Description |
|
102 |
+
| :----- |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
103 |
+
| Group index | The index of selected manipulation sequences's parameter group. |
|
104 |
+
| Target prompt | The prompt used to specify the desired visual attribute for the image after manipulation. |
|
105 |
+
| Anchor prompt | The prompt to specify the anchored visaul attribute for the image before manipulation. |
|
106 |
+
| Control prompt | Used for local manipulation (crossattn-based models). The prompt to specify the target regions. |
|
107 |
+
| Enhance | Specify whether this manipulation should be enhanced or not. (More likely to influence unrelated attribute). |
|
108 |
+
| Target scale | The scale used to progressively control the manipulation. |
|
109 |
+
| Thresholds | Used for local manipulation (crossattn-based models). Four hyperparameters used to reduce the influnece on irrelevant visual attributes, where 0.0 < threshold 0 < threshold 1 < threshold 2 < threshold 3 < 1.0. |
|
110 |
+
| \<Threshold0 | Select regions most related to control prompt. Indicated by deep blue. |
|
111 |
+
| Threshold0-Threshold1 | Select regions related to control prompt. Indicated by blue. |
|
112 |
+
| Threshold1-Threshold2 | Select neighbouring but unrelated regions. Indicated by green. |
|
113 |
+
| Threshold2-Threshold3 | Select unrelated regions. Indicated by orange. |
|
114 |
+
| \>Threshold3 | Select most unrelated regions. Indicated by brown. |
|
115 |
+
|Add| Click add to save current manipulation in the sequence.
|
116 |
+
|
117 |
+
## Code reference
|
118 |
+
1. [Stable Diffusion v2](https://github.com/Stability-AI/stablediffusion)
|
119 |
+
2. [Stable Diffusion XL](https://github.com/Stability-AI/generative-models)
|
120 |
+
3. [SD-webui-ControlNet](https://github.com/Mikubill/sd-webui-controlnet)
|
121 |
+
4. [Stable-Diffusion-webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui)
|
122 |
+
5. [K-diffusion](https://github.com/crowsonkb/k-diffusion)
|
123 |
+
6. [Deepspeed](https://github.com/microsoft/DeepSpeed)
|
124 |
+
7. [sketchKeras-PyTorch](https://github.com/higumax/sketchKeras-pytorch)
|