SahilCarterr commited on
Commit
703a7c0
·
verified ·
1 Parent(s): fd92c88

Upload 27 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ ControlNetInpaint/output/baseline_grid.png filter=lfs diff=lfs merge=lfs -text
37
+ ControlNetInpaint/output/canny_cheeseburger_grid.png filter=lfs diff=lfs merge=lfs -text
38
+ ControlNetInpaint/output/canny_cheeseburger.png filter=lfs diff=lfs merge=lfs -text
ControlNetInpaint/.gitignore ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Byte-compiled / optimized / DLL files
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+
6
+ # C extensions
7
+ *.so
8
+
9
+ # Distribution / packaging
10
+ .Python
11
+ build/
12
+ develop-eggs/
13
+ dist/
14
+ downloads/
15
+ eggs/
16
+ .eggs/
17
+ lib/
18
+ lib64/
19
+ parts/
20
+ sdist/
21
+ var/
22
+ wheels/
23
+ pip-wheel-metadata/
24
+ share/python-wheels/
25
+ *.egg-info/
26
+ .installed.cfg
27
+ *.egg
28
+ MANIFEST
29
+
30
+ # PyInstaller
31
+ # Usually these files are written by a python script from a template
32
+ # before PyInstaller builds the exe, so as to inject date/other infos into it.
33
+ *.manifest
34
+ *.spec
35
+
36
+ # Installer logs
37
+ pip-log.txt
38
+ pip-delete-this-directory.txt
39
+
40
+ # Unit test / coverage reports
41
+ htmlcov/
42
+ .tox/
43
+ .nox/
44
+ .coverage
45
+ .coverage.*
46
+ .cache
47
+ nosetests.xml
48
+ coverage.xml
49
+ *.cover
50
+ *.py,cover
51
+ .hypothesis/
52
+ .pytest_cache/
53
+
54
+ # Translations
55
+ *.mo
56
+ *.pot
57
+
58
+ # Django stuff:
59
+ *.log
60
+ local_settings.py
61
+ db.sqlite3
62
+ db.sqlite3-journal
63
+
64
+ # Flask stuff:
65
+ instance/
66
+ .webassets-cache
67
+
68
+ # Scrapy stuff:
69
+ .scrapy
70
+
71
+ # Sphinx documentation
72
+ docs/_build/
73
+
74
+ # PyBuilder
75
+ target/
76
+
77
+ # Jupyter Notebook
78
+ .ipynb_checkpoints
79
+
80
+ # IPython
81
+ profile_default/
82
+ ipython_config.py
83
+
84
+ # pyenv
85
+ .python-version
86
+
87
+ # pipenv
88
+ # According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
89
+ # However, in case of collaboration, if having platform-specific dependencies or dependencies
90
+ # having no cross-platform support, pipenv may install dependencies that don't work, or not
91
+ # install all needed dependencies.
92
+ #Pipfile.lock
93
+
94
+ # PEP 582; used by e.g. github.com/David-OConnor/pyflow
95
+ __pypackages__/
96
+
97
+ # Celery stuff
98
+ celerybeat-schedule
99
+ celerybeat.pid
100
+
101
+ # SageMath parsed files
102
+ *.sage.py
103
+
104
+ # Environments
105
+ .env
106
+ .venv
107
+ env/
108
+ venv/
109
+ ENV/
110
+ env.bak/
111
+ venv.bak/
112
+
113
+ # Spyder project settings
114
+ .spyderproject
115
+ .spyproject
116
+
117
+ # Rope project settings
118
+ .ropeproject
119
+
120
+ # mkdocs documentation
121
+ /site
122
+
123
+ # mypy
124
+ .mypy_cache/
125
+ .dmypy.json
126
+ dmypy.json
127
+
128
+ # Pyre type checker
129
+ .pyre/
ControlNetInpaint/ControlNet-with-Inpaint-Demo-colab.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
ControlNetInpaint/ControlNet-with-Inpaint-Demo.ipynb ADDED
@@ -0,0 +1,1130 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "43976805",
6
+ "metadata": {},
7
+ "source": [
8
+ "# Inpainting with ControlNet\n",
9
+ "This notebook contains examples of using a new `StableDiffusionControlNetInpaintPipeline`.\n",
10
+ "\n",
11
+ "The main two parameters you can play with are the strength of text guidance and image guidance:\n",
12
+ "* Text guidance (`guidance_scale`) is set to `7.5` by default, and usually this value works quite well.\n",
13
+ "* Image guidance (`controlnet_conditioning_scale`) is set to `0.4` by default. This value is a good starting point, but can be lowered if there is a big misalignment between the text prompt and the control image (meaning that it is very hard to \"imagine\" an output image that both satisfies the text prompt and aligns with the control image).\n",
14
+ "\n",
15
+ "The naming of these parameters is based on other pipelines `StableDiffusionInpaintPipeline` and `StableDiffusionControlNetPipeline` and the same convention has been preserved for consistency."
16
+ ]
17
+ },
18
+ {
19
+ "cell_type": "code",
20
+ "execution_count": null,
21
+ "id": "33c2f672",
22
+ "metadata": {},
23
+ "outputs": [],
24
+ "source": [
25
+ "from diffusers import StableDiffusionInpaintPipeline, ControlNetModel, UniPCMultistepScheduler\n",
26
+ "from src.pipeline_stable_diffusion_controlnet_inpaint import *\n",
27
+ "from diffusers.utils import load_image\n",
28
+ "\n",
29
+ "import cv2\n",
30
+ "from PIL import Image\n",
31
+ "import numpy as np\n",
32
+ "import torch\n",
33
+ "from matplotlib import pyplot as plt"
34
+ ]
35
+ },
36
+ {
37
+ "cell_type": "markdown",
38
+ "id": "cb869cff",
39
+ "metadata": {},
40
+ "source": [
41
+ "### Baseline: Stable Diffusion 1.5 Inpainting\n",
42
+ "The StableDiffusion1.5 Inpainting model is used as the core for ControlNet inpainting. For reference, you can also try to run the same results on this core model alone:"
43
+ ]
44
+ },
45
+ {
46
+ "cell_type": "code",
47
+ "execution_count": null,
48
+ "id": "f011126d",
49
+ "metadata": {},
50
+ "outputs": [],
51
+ "source": [
52
+ "pipe_sd = StableDiffusionInpaintPipeline.from_pretrained(\n",
53
+ " \"runwayml/stable-diffusion-inpainting\",\n",
54
+ " revision=\"fp16\",\n",
55
+ " torch_dtype=torch.float16,\n",
56
+ ")\n",
57
+ "# speed up diffusion process with faster scheduler and memory optimization\n",
58
+ "pipe_sd.scheduler = UniPCMultistepScheduler.from_config(pipe_sd.scheduler.config)\n",
59
+ "# remove following line if xformers is not installed\n",
60
+ "pipe_sd.enable_xformers_memory_efficient_attention()\n",
61
+ "\n",
62
+ "pipe_sd.to('cuda')"
63
+ ]
64
+ },
65
+ {
66
+ "cell_type": "markdown",
67
+ "id": "a4d89ea7",
68
+ "metadata": {},
69
+ "source": [
70
+ "### Task\n",
71
+ "Let's start by turning this dog into a red panda using various types of guidance!\n",
72
+ "\n",
73
+ "All we need is an `image`, a `mask`, and a `text_prompt` of **\"a red panda sitting on a bench\"**"
74
+ ]
75
+ },
76
+ {
77
+ "cell_type": "code",
78
+ "execution_count": null,
79
+ "id": "517add62",
80
+ "metadata": {},
81
+ "outputs": [],
82
+ "source": [
83
+ "# download an image\n",
84
+ "image = load_image(\n",
85
+ " \"https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png\"\n",
86
+ ")\n",
87
+ "image = np.array(image)\n",
88
+ "mask_image = load_image(\n",
89
+ " \"https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png\"\n",
90
+ ")\n",
91
+ "mask_image = np.array(mask_image)\n",
92
+ "\n",
93
+ "text_prompt=\"a red panda sitting on a bench\"\n",
94
+ "\n",
95
+ "plt.figure(figsize=(12,4))\n",
96
+ "\n",
97
+ "plt.subplot(1,2,1)\n",
98
+ "plt.imshow(image)\n",
99
+ "plt.axis('off')\n",
100
+ "plt.title('Input')\n",
101
+ "plt.subplot(1,2,2)\n",
102
+ "plt.imshow((255-np.array(image))*(255-np.array(mask_image)))\n",
103
+ "plt.axis('off')\n",
104
+ "plt.title('Masked')"
105
+ ]
106
+ },
107
+ {
108
+ "cell_type": "markdown",
109
+ "id": "489f2543",
110
+ "metadata": {},
111
+ "source": [
112
+ "## Canny Edge"
113
+ ]
114
+ },
115
+ {
116
+ "cell_type": "code",
117
+ "execution_count": null,
118
+ "id": "906b2654",
119
+ "metadata": {},
120
+ "outputs": [],
121
+ "source": [
122
+ "# get canny image\n",
123
+ "canny_image = cv2.Canny(image, 100, 200)\n",
124
+ "canny_image = canny_image[:, :, None]\n",
125
+ "canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2)\n",
126
+ "\n",
127
+ "image=Image.fromarray(image)\n",
128
+ "mask_image=Image.fromarray(mask_image)\n",
129
+ "canny_image = Image.fromarray(canny_image)\n",
130
+ "\n",
131
+ "canny_image"
132
+ ]
133
+ },
134
+ {
135
+ "cell_type": "code",
136
+ "execution_count": null,
137
+ "id": "41d35b98",
138
+ "metadata": {},
139
+ "outputs": [],
140
+ "source": [
141
+ "# load control net and stable diffusion v1-5\n",
142
+ "controlnet = ControlNetModel.from_pretrained(\"lllyasviel/sd-controlnet-canny\", torch_dtype=torch.float16)\n",
143
+ "pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(\n",
144
+ " \"runwayml/stable-diffusion-inpainting\", controlnet=controlnet, torch_dtype=torch.float16\n",
145
+ " )\n",
146
+ "\n",
147
+ "# speed up diffusion process with faster scheduler and memory optimization\n",
148
+ "pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)\n",
149
+ "# remove following line if xformers is not installed\n",
150
+ "pipe.enable_xformers_memory_efficient_attention()"
151
+ ]
152
+ },
153
+ {
154
+ "cell_type": "markdown",
155
+ "id": "d6146702",
156
+ "metadata": {},
157
+ "source": [
158
+ "### Scaling image control...\n",
159
+ "In this example, `canny_image` input is actually quite hard to satisfy with the our text prompt due to a lot of local noise. In this special case, we adjust `controlnet_conditioning_scale` to `0.5` to make this guidance more subtle.\n",
160
+ "\n",
161
+ "In all other examples, the default value of `controlnet_conditioning_scale` = `1.0` works rather well!"
162
+ ]
163
+ },
164
+ {
165
+ "cell_type": "code",
166
+ "execution_count": null,
167
+ "id": "e5069621",
168
+ "metadata": {},
169
+ "outputs": [],
170
+ "source": [
171
+ "pipe.to('cuda')\n",
172
+ "\n",
173
+ "# generate image\n",
174
+ "generator = torch.manual_seed(0)\n",
175
+ "new_image = pipe(\n",
176
+ " text_prompt,\n",
177
+ " num_inference_steps=20,\n",
178
+ " generator=generator,\n",
179
+ " image=image,\n",
180
+ " control_image=canny_image,\n",
181
+ " controlnet_conditioning_scale = 0.5,\n",
182
+ " mask_image=mask_image\n",
183
+ ").images[0]\n",
184
+ "\n",
185
+ "new_image.save('output/canny_result.png')"
186
+ ]
187
+ },
188
+ {
189
+ "cell_type": "code",
190
+ "execution_count": null,
191
+ "id": "2f9c6ff6",
192
+ "metadata": {},
193
+ "outputs": [],
194
+ "source": [
195
+ "plt.figure(figsize=(12,4))\n",
196
+ "\n",
197
+ "plt.subplot(1,4,1)\n",
198
+ "plt.imshow(image)\n",
199
+ "plt.axis('off')\n",
200
+ "plt.title('Input')\n",
201
+ "plt.subplot(1,4,2)\n",
202
+ "plt.imshow((255-np.array(image))*(255-np.array(mask_image)))\n",
203
+ "plt.axis('off')\n",
204
+ "plt.title('Masked')\n",
205
+ "plt.subplot(1,4,3)\n",
206
+ "plt.imshow(canny_image)\n",
207
+ "plt.axis('off')\n",
208
+ "plt.title('Condition')\n",
209
+ "plt.subplot(1,4,4)\n",
210
+ "plt.imshow(new_image)\n",
211
+ "plt.title('Output')\n",
212
+ "plt.axis('off')\n",
213
+ "\n",
214
+ "plt.savefig('output/canny_grid.png',\n",
215
+ " dpi=200,\n",
216
+ " bbox_inches='tight',\n",
217
+ " pad_inches=0.0\n",
218
+ " )"
219
+ ]
220
+ },
221
+ {
222
+ "cell_type": "markdown",
223
+ "id": "87de0502",
224
+ "metadata": {},
225
+ "source": [
226
+ "### Comparison: vanilla inpainting from StableDiffusion1.5"
227
+ ]
228
+ },
229
+ {
230
+ "cell_type": "code",
231
+ "execution_count": null,
232
+ "id": "f2ef71fe",
233
+ "metadata": {},
234
+ "outputs": [],
235
+ "source": [
236
+ "# generate image\n",
237
+ "generator = torch.manual_seed(0)\n",
238
+ "new_image = pipe_sd(\n",
239
+ " text_prompt,\n",
240
+ " num_inference_steps=20,\n",
241
+ " generator=generator,\n",
242
+ " image=image,\n",
243
+ " mask_image=mask_image\n",
244
+ ").images[0]\n",
245
+ "\n",
246
+ "new_image.save('output/baseline_result.png')"
247
+ ]
248
+ },
249
+ {
250
+ "cell_type": "code",
251
+ "execution_count": null,
252
+ "id": "e09513c8",
253
+ "metadata": {},
254
+ "outputs": [],
255
+ "source": [
256
+ "plt.figure(figsize=(12,4))\n",
257
+ "\n",
258
+ "plt.subplot(1,3,1)\n",
259
+ "plt.imshow(image)\n",
260
+ "plt.axis('off')\n",
261
+ "plt.title('Input')\n",
262
+ "plt.subplot(1,3,2)\n",
263
+ "plt.imshow((255-np.array(image))*(255-np.array(mask_image)))\n",
264
+ "plt.axis('off')\n",
265
+ "plt.title('Masked')\n",
266
+ "plt.subplot(1,3,3)\n",
267
+ "plt.imshow(new_image)\n",
268
+ "plt.title('Output')\n",
269
+ "plt.axis('off')\n",
270
+ "\n",
271
+ "plt.savefig('output/baseline_grid.png',\n",
272
+ " dpi=200,\n",
273
+ " bbox_inches='tight',\n",
274
+ " pad_inches=0.0\n",
275
+ " )"
276
+ ]
277
+ },
278
+ {
279
+ "cell_type": "markdown",
280
+ "id": "0569600e",
281
+ "metadata": {},
282
+ "source": [
283
+ "## Challenging Examples 🐕➡️🍔\n",
284
+ "Let's see how tuning the `controlnet_conditioning_scale` works out for a more challenging example of turning the dog into a cheeseburger!\n",
285
+ "\n",
286
+ "In this case, we **demand a large semantic leap** and that requires a more subtle guide from the control image!"
287
+ ]
288
+ },
289
+ {
290
+ "cell_type": "code",
291
+ "execution_count": null,
292
+ "id": "69a352a4",
293
+ "metadata": {},
294
+ "outputs": [],
295
+ "source": [
296
+ "difficult_text_prompt=\"a big cheeseburger sitting on a bench\""
297
+ ]
298
+ },
299
+ {
300
+ "cell_type": "code",
301
+ "execution_count": null,
302
+ "id": "0803c982",
303
+ "metadata": {},
304
+ "outputs": [],
305
+ "source": [
306
+ "# First - StableDiffusion1.5 baseline (no ControlNet)\n",
307
+ "\n",
308
+ "# generate image\n",
309
+ "generator = torch.manual_seed(0)\n",
310
+ "new_image = pipe_sd(\n",
311
+ " difficult_text_prompt,\n",
312
+ " num_inference_steps=20,\n",
313
+ " generator=generator,\n",
314
+ " image=image,\n",
315
+ " mask_image=mask_image\n",
316
+ ").images[0]\n",
317
+ "\n",
318
+ "sd_output=new_image\n",
319
+ "sd_output"
320
+ ]
321
+ },
322
+ {
323
+ "cell_type": "code",
324
+ "execution_count": null,
325
+ "id": "319b867e",
326
+ "metadata": {},
327
+ "outputs": [],
328
+ "source": []
329
+ },
330
+ {
331
+ "cell_type": "code",
332
+ "execution_count": null,
333
+ "id": "89dbb557",
334
+ "metadata": {},
335
+ "outputs": [],
336
+ "source": []
337
+ },
338
+ {
339
+ "cell_type": "code",
340
+ "execution_count": null,
341
+ "id": "d6d74fdd",
342
+ "metadata": {},
343
+ "outputs": [],
344
+ "source": []
345
+ },
346
+ {
347
+ "cell_type": "markdown",
348
+ "id": "bdaa2483",
349
+ "metadata": {},
350
+ "source": [
351
+ "## HED"
352
+ ]
353
+ },
354
+ {
355
+ "cell_type": "code",
356
+ "execution_count": null,
357
+ "id": "1c5f1ead",
358
+ "metadata": {},
359
+ "outputs": [],
360
+ "source": [
361
+ "from controlnet_aux import HEDdetector\n",
362
+ "\n",
363
+ "hed = HEDdetector.from_pretrained('lllyasviel/ControlNet')\n",
364
+ "\n",
365
+ "hed_image = hed(image)"
366
+ ]
367
+ },
368
+ {
369
+ "cell_type": "code",
370
+ "execution_count": null,
371
+ "id": "192a9881",
372
+ "metadata": {},
373
+ "outputs": [],
374
+ "source": [
375
+ "controlnet = ControlNetModel.from_pretrained(\n",
376
+ " \"fusing/stable-diffusion-v1-5-controlnet-hed\", torch_dtype=torch.float16\n",
377
+ ")\n",
378
+ "pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(\n",
379
+ " \"runwayml/stable-diffusion-inpainting\", controlnet=controlnet, torch_dtype=torch.float16\n",
380
+ " )\n",
381
+ "\n",
382
+ "pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)\n",
383
+ "\n",
384
+ "# Remove if you do not have xformers installed\n",
385
+ "# see https://huggingface.co/docs/diffusers/v0.13.0/en/optimization/xformers#installing-xformers\n",
386
+ "# for installation instructions\n",
387
+ "pipe.enable_xformers_memory_efficient_attention()"
388
+ ]
389
+ },
390
+ {
391
+ "cell_type": "code",
392
+ "execution_count": null,
393
+ "id": "aa054f4e",
394
+ "metadata": {},
395
+ "outputs": [],
396
+ "source": [
397
+ "pipe.to('cuda')\n",
398
+ "\n",
399
+ "# generate image\n",
400
+ "generator = torch.manual_seed(0)\n",
401
+ "new_image = pipe(\n",
402
+ " text_prompt,\n",
403
+ " num_inference_steps=20,\n",
404
+ " generator=generator,\n",
405
+ " image=image,\n",
406
+ " control_image=hed_image,\n",
407
+ " mask_image=mask_image\n",
408
+ ").images[0]\n",
409
+ "\n",
410
+ "new_image.save('output/hed_result.png')"
411
+ ]
412
+ },
413
+ {
414
+ "cell_type": "code",
415
+ "execution_count": null,
416
+ "id": "cc33ddfa",
417
+ "metadata": {},
418
+ "outputs": [],
419
+ "source": [
420
+ "plt.figure(figsize=(12,4))\n",
421
+ "\n",
422
+ "plt.subplot(1,4,1)\n",
423
+ "plt.imshow(image)\n",
424
+ "plt.axis('off')\n",
425
+ "plt.title('Input')\n",
426
+ "plt.subplot(1,4,2)\n",
427
+ "plt.imshow((255-np.array(image))*(255-np.array(mask_image)))\n",
428
+ "plt.axis('off')\n",
429
+ "plt.title('Masked')\n",
430
+ "plt.subplot(1,4,3)\n",
431
+ "plt.imshow(hed_image)\n",
432
+ "plt.axis('off')\n",
433
+ "plt.title('Condition')\n",
434
+ "plt.subplot(1,4,4)\n",
435
+ "plt.imshow(new_image)\n",
436
+ "plt.title('Output')\n",
437
+ "plt.axis('off')\n",
438
+ "\n",
439
+ "plt.savefig('output/hed_grid.png',\n",
440
+ " dpi=200,\n",
441
+ " bbox_inches='tight',\n",
442
+ " pad_inches=0.0\n",
443
+ " )"
444
+ ]
445
+ },
446
+ {
447
+ "cell_type": "markdown",
448
+ "id": "1be22a64",
449
+ "metadata": {},
450
+ "source": [
451
+ "### Scribble"
452
+ ]
453
+ },
454
+ {
455
+ "cell_type": "code",
456
+ "execution_count": null,
457
+ "id": "e4b376bb",
458
+ "metadata": {},
459
+ "outputs": [],
460
+ "source": [
461
+ "from controlnet_aux import HEDdetector\n",
462
+ "\n",
463
+ "hed = HEDdetector.from_pretrained('lllyasviel/ControlNet')\n",
464
+ "\n",
465
+ "scribble_image = hed(image,scribble=True)"
466
+ ]
467
+ },
468
+ {
469
+ "cell_type": "code",
470
+ "execution_count": null,
471
+ "id": "b0c63b8f",
472
+ "metadata": {},
473
+ "outputs": [],
474
+ "source": [
475
+ "controlnet = ControlNetModel.from_pretrained(\n",
476
+ " \"fusing/stable-diffusion-v1-5-controlnet-scribble\", torch_dtype=torch.float16\n",
477
+ ")\n",
478
+ "pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(\n",
479
+ " \"runwayml/stable-diffusion-inpainting\", controlnet=controlnet, torch_dtype=torch.float16\n",
480
+ " )\n",
481
+ "\n",
482
+ "pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)\n",
483
+ "\n",
484
+ "# Remove if you do not have xformers installed\n",
485
+ "# see https://huggingface.co/docs/diffusers/v0.13.0/en/optimization/xformers#installing-xformers\n",
486
+ "# for installation instructions\n",
487
+ "pipe.enable_xformers_memory_efficient_attention()\n",
488
+ "\n",
489
+ "#pipe.enable_model_cpu_offload()"
490
+ ]
491
+ },
492
+ {
493
+ "cell_type": "code",
494
+ "execution_count": null,
495
+ "id": "f30189e0",
496
+ "metadata": {},
497
+ "outputs": [],
498
+ "source": [
499
+ "pipe.to('cuda')\n",
500
+ "\n",
501
+ "# generate image\n",
502
+ "generator = torch.manual_seed(0)\n",
503
+ "new_image = pipe(\n",
504
+ " text_prompt,\n",
505
+ " num_inference_steps=20,\n",
506
+ " generator=generator,\n",
507
+ " image=image,\n",
508
+ " control_image=scribble_image,\n",
509
+ " mask_image=mask_image\n",
510
+ ").images[0]\n",
511
+ "\n",
512
+ "new_image.save('output/scribble_result.png')"
513
+ ]
514
+ },
515
+ {
516
+ "cell_type": "code",
517
+ "execution_count": null,
518
+ "id": "8de59fe6",
519
+ "metadata": {},
520
+ "outputs": [],
521
+ "source": [
522
+ "plt.figure(figsize=(12,4))\n",
523
+ "\n",
524
+ "plt.subplot(1,4,1)\n",
525
+ "plt.imshow(image)\n",
526
+ "plt.axis('off')\n",
527
+ "plt.title('Input')\n",
528
+ "plt.subplot(1,4,2)\n",
529
+ "plt.imshow((255-np.array(image))*(255-np.array(mask_image)))\n",
530
+ "plt.axis('off')\n",
531
+ "plt.title('Masked')\n",
532
+ "plt.subplot(1,4,3)\n",
533
+ "plt.imshow(scribble_image)\n",
534
+ "plt.axis('off')\n",
535
+ "plt.title('Condition')\n",
536
+ "plt.subplot(1,4,4)\n",
537
+ "plt.imshow(new_image)\n",
538
+ "plt.title('Output')\n",
539
+ "plt.axis('off')\n",
540
+ "\n",
541
+ "plt.savefig('output/scribble_grid.png',\n",
542
+ " dpi=200,\n",
543
+ " bbox_inches='tight',\n",
544
+ " pad_inches=0.0\n",
545
+ " )"
546
+ ]
547
+ },
548
+ {
549
+ "cell_type": "markdown",
550
+ "id": "e30c6ce2",
551
+ "metadata": {},
552
+ "source": [
553
+ "### Depth"
554
+ ]
555
+ },
556
+ {
557
+ "cell_type": "code",
558
+ "execution_count": null,
559
+ "id": "f681c4d6",
560
+ "metadata": {},
561
+ "outputs": [],
562
+ "source": [
563
+ "from transformers import pipeline\n",
564
+ "\n",
565
+ "depth_estimator = pipeline('depth-estimation')\n",
566
+ "\n",
567
+ "depth_image = depth_estimator(image)['depth']\n",
568
+ "depth_image = np.array(depth_image)\n",
569
+ "depth_image = depth_image[:, :, None]\n",
570
+ "depth_image = np.concatenate(3*[depth_image], axis=2)\n",
571
+ "depth_image = Image.fromarray(depth_image)"
572
+ ]
573
+ },
574
+ {
575
+ "cell_type": "code",
576
+ "execution_count": null,
577
+ "id": "3f8fdcf5",
578
+ "metadata": {},
579
+ "outputs": [],
580
+ "source": [
581
+ "controlnet = ControlNetModel.from_pretrained(\n",
582
+ " \"fusing/stable-diffusion-v1-5-controlnet-depth\", torch_dtype=torch.float16\n",
583
+ ")\n",
584
+ "pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(\n",
585
+ " \"runwayml/stable-diffusion-inpainting\", controlnet=controlnet, torch_dtype=torch.float16\n",
586
+ " )\n",
587
+ "\n",
588
+ "pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)\n",
589
+ "\n",
590
+ "# Remove if you do not have xformers installed\n",
591
+ "# see https://huggingface.co/docs/diffusers/v0.13.0/en/optimization/xformers#installing-xformers\n",
592
+ "# for installation instructions\n",
593
+ "pipe.enable_xformers_memory_efficient_attention()"
594
+ ]
595
+ },
596
+ {
597
+ "cell_type": "code",
598
+ "execution_count": null,
599
+ "id": "58ab718d",
600
+ "metadata": {},
601
+ "outputs": [],
602
+ "source": [
603
+ "pipe.to('cuda')\n",
604
+ "\n",
605
+ "# generate image\n",
606
+ "generator = torch.manual_seed(0)\n",
607
+ "new_image = pipe(\n",
608
+ " text_prompt,\n",
609
+ " num_inference_steps=20,\n",
610
+ " generator=generator,\n",
611
+ " image=image,\n",
612
+ " control_image=depth_image,\n",
613
+ " mask_image=mask_image\n",
614
+ ").images[0]\n",
615
+ "\n",
616
+ "new_image.save('output/depth_result.png')"
617
+ ]
618
+ },
619
+ {
620
+ "cell_type": "code",
621
+ "execution_count": null,
622
+ "id": "82ac435e",
623
+ "metadata": {},
624
+ "outputs": [],
625
+ "source": [
626
+ "plt.figure(figsize=(12,4))\n",
627
+ "\n",
628
+ "plt.subplot(1,4,1)\n",
629
+ "plt.imshow(image)\n",
630
+ "plt.axis('off')\n",
631
+ "plt.title('Input')\n",
632
+ "plt.subplot(1,4,2)\n",
633
+ "plt.imshow((255-np.array(image))*(255-np.array(mask_image)))\n",
634
+ "plt.axis('off')\n",
635
+ "plt.title('Masked')\n",
636
+ "plt.subplot(1,4,3)\n",
637
+ "plt.imshow(depth_image)\n",
638
+ "plt.axis('off')\n",
639
+ "plt.title('Condition')\n",
640
+ "plt.subplot(1,4,4)\n",
641
+ "plt.imshow(new_image)\n",
642
+ "plt.title('Output')\n",
643
+ "plt.axis('off')\n",
644
+ "\n",
645
+ "plt.savefig('output/depth_grid.png',\n",
646
+ " dpi=200,\n",
647
+ " bbox_inches='tight',\n",
648
+ " pad_inches=0.0\n",
649
+ " )"
650
+ ]
651
+ },
652
+ {
653
+ "cell_type": "markdown",
654
+ "id": "93db13cb",
655
+ "metadata": {},
656
+ "source": [
657
+ "### Normal Map"
658
+ ]
659
+ },
660
+ {
661
+ "cell_type": "code",
662
+ "execution_count": null,
663
+ "id": "08ffd6da",
664
+ "metadata": {},
665
+ "outputs": [],
666
+ "source": [
667
+ "import cv2\n",
668
+ "\n",
669
+ "depth_estimator = pipeline(\"depth-estimation\", model =\"Intel/dpt-hybrid-midas\" )\n",
670
+ "\n",
671
+ "normal_image = depth_estimator(image)['predicted_depth'][0]\n",
672
+ "\n",
673
+ "normal_image = normal_image.numpy()\n",
674
+ "\n",
675
+ "image_depth = normal_image.copy()\n",
676
+ "image_depth -= np.min(image_depth)\n",
677
+ "image_depth /= np.max(image_depth)\n",
678
+ "\n",
679
+ "bg_threhold = 0.4\n",
680
+ "\n",
681
+ "x = cv2.Sobel(normal_image, cv2.CV_32F, 1, 0, ksize=3)\n",
682
+ "x[image_depth < bg_threhold] = 0\n",
683
+ "\n",
684
+ "y = cv2.Sobel(normal_image, cv2.CV_32F, 0, 1, ksize=3)\n",
685
+ "y[image_depth < bg_threhold] = 0\n",
686
+ "\n",
687
+ "z = np.ones_like(x) * np.pi * 2.0\n",
688
+ "\n",
689
+ "normal_image = np.stack([x, y, z], axis=2)\n",
690
+ "normal_image /= np.sum(normal_image ** 2.0, axis=2, keepdims=True) ** 0.5\n",
691
+ "normal_image = (normal_image * 127.5 + 127.5).clip(0, 255).astype(np.uint8)\n",
692
+ "normal_image = Image.fromarray(normal_image).resize((512,512))"
693
+ ]
694
+ },
695
+ {
696
+ "cell_type": "code",
697
+ "execution_count": null,
698
+ "id": "c41bd52b",
699
+ "metadata": {},
700
+ "outputs": [],
701
+ "source": [
702
+ "controlnet = ControlNetModel.from_pretrained(\n",
703
+ " \"fusing/stable-diffusion-v1-5-controlnet-normal\", torch_dtype=torch.float16\n",
704
+ ")\n",
705
+ "pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(\n",
706
+ " \"runwayml/stable-diffusion-inpainting\", controlnet=controlnet, torch_dtype=torch.float16\n",
707
+ " )\n",
708
+ "\n",
709
+ "pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)\n",
710
+ "\n",
711
+ "# Remove if you do not have xformers installed\n",
712
+ "# see https://huggingface.co/docs/diffusers/v0.13.0/en/optimization/xformers#installing-xformers\n",
713
+ "# for installation instructions\n",
714
+ "pipe.enable_xformers_memory_efficient_attention()"
715
+ ]
716
+ },
717
+ {
718
+ "cell_type": "code",
719
+ "execution_count": null,
720
+ "id": "c8b5a39e",
721
+ "metadata": {},
722
+ "outputs": [],
723
+ "source": [
724
+ "pipe.to('cuda')\n",
725
+ "\n",
726
+ "# generate image\n",
727
+ "generator = torch.manual_seed(0)\n",
728
+ "new_image = pipe(\n",
729
+ " text_prompt,\n",
730
+ " num_inference_steps=20,\n",
731
+ " generator=generator,\n",
732
+ " image=image,\n",
733
+ " control_image=normal_image,\n",
734
+ " mask_image=mask_image\n",
735
+ ").images[0]\n",
736
+ "\n",
737
+ "new_image.save('output/normal_result.png')"
738
+ ]
739
+ },
740
+ {
741
+ "cell_type": "code",
742
+ "execution_count": null,
743
+ "id": "2737d23f",
744
+ "metadata": {},
745
+ "outputs": [],
746
+ "source": [
747
+ "plt.figure(figsize=(12,4))\n",
748
+ "\n",
749
+ "plt.subplot(1,4,1)\n",
750
+ "plt.imshow(image)\n",
751
+ "plt.axis('off')\n",
752
+ "plt.title('Input')\n",
753
+ "plt.subplot(1,4,2)\n",
754
+ "plt.imshow((255-np.array(image))*(255-np.array(mask_image)))\n",
755
+ "plt.axis('off')\n",
756
+ "plt.title('Masked')\n",
757
+ "plt.subplot(1,4,3)\n",
758
+ "plt.imshow(normal_image)\n",
759
+ "plt.axis('off')\n",
760
+ "plt.title('Condition')\n",
761
+ "plt.subplot(1,4,4)\n",
762
+ "plt.imshow(new_image)\n",
763
+ "plt.title('Output')\n",
764
+ "plt.axis('off')\n",
765
+ "\n",
766
+ "plt.savefig('output/normal_grid.png',\n",
767
+ " dpi=200,\n",
768
+ " bbox_inches='tight',\n",
769
+ " pad_inches=0.0\n",
770
+ " )"
771
+ ]
772
+ },
773
+ {
774
+ "cell_type": "markdown",
775
+ "id": "04683be6",
776
+ "metadata": {},
777
+ "source": [
778
+ "### More control input types\n",
779
+ "For these control input types, we will use a different image as in those cases, an image of the dog on the bench is not appropriate!\n",
780
+ "\n",
781
+ "Let's start with a room photo..."
782
+ ]
783
+ },
784
+ {
785
+ "cell_type": "markdown",
786
+ "id": "2d5c7d55",
787
+ "metadata": {},
788
+ "source": [
789
+ "### M-LSD"
790
+ ]
791
+ },
792
+ {
793
+ "cell_type": "code",
794
+ "execution_count": null,
795
+ "id": "9d2e3a7b",
796
+ "metadata": {},
797
+ "outputs": [],
798
+ "source": [
799
+ "from controlnet_aux import MLSDdetector\n",
800
+ "\n",
801
+ "mlsd = MLSDdetector.from_pretrained('lllyasviel/ControlNet')\n",
802
+ "\n",
803
+ "room_image = load_image(\"https://huggingface.co/lllyasviel/sd-controlnet-mlsd/resolve/main/images/room.png\")\n",
804
+ "\n",
805
+ "mlsd_image = mlsd(room_image).resize(room_image.size)\n",
806
+ "#room_image = room_image"
807
+ ]
808
+ },
809
+ {
810
+ "cell_type": "code",
811
+ "execution_count": null,
812
+ "id": "45629903",
813
+ "metadata": {},
814
+ "outputs": [],
815
+ "source": [
816
+ "room_mask=np.zeros_like(np.array(room_image))\n",
817
+ "room_mask[120:420,220:,:]=255\n",
818
+ "room_mask=Image.fromarray(room_mask)\n",
819
+ "\n",
820
+ "\n",
821
+ "room_mask=room_mask.resize((512,512))\n",
822
+ "mlsd_image=mlsd_image.resize((512,512))\n",
823
+ "room_image=room_image.resize((512,512))"
824
+ ]
825
+ },
826
+ {
827
+ "cell_type": "code",
828
+ "execution_count": null,
829
+ "id": "e491ab22",
830
+ "metadata": {},
831
+ "outputs": [],
832
+ "source": [
833
+ "controlnet = ControlNetModel.from_pretrained(\n",
834
+ " \"fusing/stable-diffusion-v1-5-controlnet-mlsd\", torch_dtype=torch.float16\n",
835
+ ")\n",
836
+ "pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(\n",
837
+ " \"runwayml/stable-diffusion-inpainting\", controlnet=controlnet, torch_dtype=torch.float16\n",
838
+ " )\n",
839
+ "\n",
840
+ "pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)\n",
841
+ "\n",
842
+ "# Remove if you do not have xformers installed\n",
843
+ "# see https://huggingface.co/docs/diffusers/v0.13.0/en/optimization/xformers#installing-xformers\n",
844
+ "# for installation instructions\n",
845
+ "pipe.enable_xformers_memory_efficient_attention()"
846
+ ]
847
+ },
848
+ {
849
+ "cell_type": "code",
850
+ "execution_count": null,
851
+ "id": "b414f354",
852
+ "metadata": {},
853
+ "outputs": [],
854
+ "source": [
855
+ "pipe.to('cuda')\n",
856
+ "\n",
857
+ "# generate image\n",
858
+ "generator = torch.manual_seed(0)\n",
859
+ "new_image = pipe(\n",
860
+ " \"an image of a room with a city skyline view\",\n",
861
+ " num_inference_steps=20,\n",
862
+ " generator=generator,\n",
863
+ " image=room_image,\n",
864
+ " control_image=mlsd_image,\n",
865
+ " mask_image=room_mask\n",
866
+ ").images[0]\n",
867
+ "\n",
868
+ "new_image.save('output/mlsd_result.png')"
869
+ ]
870
+ },
871
+ {
872
+ "cell_type": "code",
873
+ "execution_count": null,
874
+ "id": "326145e1",
875
+ "metadata": {},
876
+ "outputs": [],
877
+ "source": [
878
+ "plt.figure(figsize=(12,4))\n",
879
+ "\n",
880
+ "plt.subplot(1,4,1)\n",
881
+ "plt.imshow(room_image)\n",
882
+ "plt.axis('off')\n",
883
+ "plt.title('Input')\n",
884
+ "plt.subplot(1,4,2)\n",
885
+ "plt.imshow((255-np.array(room_image))*(255-np.array(room_mask)))\n",
886
+ "plt.axis('off')\n",
887
+ "plt.title('Masked')\n",
888
+ "plt.subplot(1,4,3)\n",
889
+ "plt.imshow(mlsd_image)\n",
890
+ "plt.axis('off')\n",
891
+ "plt.title('Condition')\n",
892
+ "plt.subplot(1,4,4)\n",
893
+ "plt.imshow(new_image)\n",
894
+ "plt.title('Output')\n",
895
+ "plt.axis('off')\n",
896
+ "\n",
897
+ "plt.savefig('output/mlsd_grid.png',\n",
898
+ " dpi=200,\n",
899
+ " bbox_inches='tight',\n",
900
+ " pad_inches=0.0\n",
901
+ " )"
902
+ ]
903
+ },
904
+ {
905
+ "cell_type": "markdown",
906
+ "id": "1f68f30b",
907
+ "metadata": {},
908
+ "source": [
909
+ "### OpenPose"
910
+ ]
911
+ },
912
+ {
913
+ "cell_type": "code",
914
+ "execution_count": null,
915
+ "id": "bbf9b00b",
916
+ "metadata": {},
917
+ "outputs": [],
918
+ "source": [
919
+ "controlnet = ControlNetModel.from_pretrained(\n",
920
+ " \"fusing/stable-diffusion-v1-5-controlnet-openpose\", torch_dtype=torch.float16\n",
921
+ ")\n",
922
+ "pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(\n",
923
+ " \"runwayml/stable-diffusion-inpainting\", controlnet=controlnet, torch_dtype=torch.float16\n",
924
+ " )\n",
925
+ "\n",
926
+ "pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)\n",
927
+ "\n",
928
+ "# Remove if you do not have xformers installed\n",
929
+ "# see https://huggingface.co/docs/diffusers/v0.13.0/en/optimization/xformers#installing-xformers\n",
930
+ "# for installation instructions\n",
931
+ "pipe.enable_xformers_memory_efficient_attention()"
932
+ ]
933
+ },
934
+ {
935
+ "cell_type": "code",
936
+ "execution_count": null,
937
+ "id": "e819d17c",
938
+ "metadata": {},
939
+ "outputs": [],
940
+ "source": [
941
+ "from controlnet_aux import OpenposeDetector\n",
942
+ "\n",
943
+ "openpose = OpenposeDetector.from_pretrained('lllyasviel/ControlNet')\n",
944
+ "\n",
945
+ "pose_real_image = load_image(\"https://huggingface.co/lllyasviel/sd-controlnet-openpose/resolve/main/images/pose.png\")\n",
946
+ "\n",
947
+ "pose_image = openpose(pose_real_image)\n",
948
+ "pose_real_image=pose_real_image.resize(pose_image.size)\n",
949
+ "\n",
950
+ "pose_mask=np.zeros_like(np.array(pose_image))\n",
951
+ "pose_mask[250:700,:,:]=255\n",
952
+ "pose_mask=Image.fromarray(pose_mask)"
953
+ ]
954
+ },
955
+ {
956
+ "cell_type": "code",
957
+ "execution_count": null,
958
+ "id": "2b6faf93",
959
+ "metadata": {},
960
+ "outputs": [],
961
+ "source": [
962
+ "pipe.to('cuda')\n",
963
+ "\n",
964
+ "# generate image\n",
965
+ "generator = torch.manual_seed(0)\n",
966
+ "new_image = pipe(\n",
967
+ " \"a man in a knight armor\",\n",
968
+ " num_inference_steps=20,\n",
969
+ " generator=generator,\n",
970
+ " image=pose_real_image,\n",
971
+ " control_image=pose_image,\n",
972
+ " mask_image=pose_mask\n",
973
+ ").images[0]\n",
974
+ "\n",
975
+ "new_image.save('output/openpose_result.png')"
976
+ ]
977
+ },
978
+ {
979
+ "cell_type": "code",
980
+ "execution_count": null,
981
+ "id": "a665a931",
982
+ "metadata": {},
983
+ "outputs": [],
984
+ "source": [
985
+ "plt.figure(figsize=(12,4))\n",
986
+ "\n",
987
+ "plt.subplot(1,4,1)\n",
988
+ "plt.imshow(pose_real_image)\n",
989
+ "plt.axis('off')\n",
990
+ "plt.title('Input')\n",
991
+ "plt.subplot(1,4,2)\n",
992
+ "plt.imshow((255-np.array(pose_real_image))*(255-np.array(pose_mask)))\n",
993
+ "plt.axis('off')\n",
994
+ "plt.title('Masked')\n",
995
+ "plt.subplot(1,4,3)\n",
996
+ "plt.imshow(pose_image)\n",
997
+ "plt.axis('off')\n",
998
+ "plt.title('Condition')\n",
999
+ "plt.subplot(1,4,4)\n",
1000
+ "plt.imshow(new_image)\n",
1001
+ "plt.title('Output')\n",
1002
+ "plt.axis('off')\n",
1003
+ "\n",
1004
+ "\n",
1005
+ "plt.savefig('output/openpose_grid.png',\n",
1006
+ " dpi=200,\n",
1007
+ " bbox_inches='tight',\n",
1008
+ " pad_inches=0.0\n",
1009
+ " )"
1010
+ ]
1011
+ },
1012
+ {
1013
+ "cell_type": "markdown",
1014
+ "id": "b982380d",
1015
+ "metadata": {},
1016
+ "source": [
1017
+ "### Segmentation Mask"
1018
+ ]
1019
+ },
1020
+ {
1021
+ "cell_type": "code",
1022
+ "execution_count": null,
1023
+ "id": "f667b04a",
1024
+ "metadata": {},
1025
+ "outputs": [],
1026
+ "source": [
1027
+ "controlnet = ControlNetModel.from_pretrained(\n",
1028
+ " \"fusing/stable-diffusion-v1-5-controlnet-seg\", torch_dtype=torch.float16\n",
1029
+ ")\n",
1030
+ "pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(\n",
1031
+ " \"runwayml/stable-diffusion-inpainting\", controlnet=controlnet, torch_dtype=torch.float16\n",
1032
+ " )\n",
1033
+ "\n",
1034
+ "pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)\n",
1035
+ "\n",
1036
+ "# Remove if you do not have xformers installed\n",
1037
+ "# see https://huggingface.co/docs/diffusers/v0.13.0/en/optimization/xformers#installing-xformers\n",
1038
+ "# for installation instructions\n",
1039
+ "pipe.enable_xformers_memory_efficient_attention()"
1040
+ ]
1041
+ },
1042
+ {
1043
+ "cell_type": "code",
1044
+ "execution_count": null,
1045
+ "id": "cb27c72b",
1046
+ "metadata": {},
1047
+ "outputs": [],
1048
+ "source": [
1049
+ "house_real_image=load_image(\"https://huggingface.co/lllyasviel/sd-controlnet-seg/resolve/main/images/house.png\")\n",
1050
+ "seg_image=load_image(\"https://huggingface.co/lllyasviel/sd-controlnet-seg/resolve/main/images/house_seg.png\")\n",
1051
+ "\n",
1052
+ "house_mask=np.zeros((*seg_image.size,3),dtype='uint8')\n",
1053
+ "house_mask[50:400,-350:,:]=255\n",
1054
+ "house_mask=Image.fromarray(house_mask)"
1055
+ ]
1056
+ },
1057
+ {
1058
+ "cell_type": "code",
1059
+ "execution_count": null,
1060
+ "id": "1f81e50b",
1061
+ "metadata": {},
1062
+ "outputs": [],
1063
+ "source": [
1064
+ "pipe.to('cuda')\n",
1065
+ "\n",
1066
+ "# generate image\n",
1067
+ "generator = torch.manual_seed(0)\n",
1068
+ "new_image = pipe(\n",
1069
+ " \"a pink eerie scary house\",\n",
1070
+ " num_inference_steps=20,\n",
1071
+ " generator=generator,\n",
1072
+ " image=house_real_image,\n",
1073
+ " control_image=seg_image,\n",
1074
+ " mask_image=house_mask\n",
1075
+ ").images[0]\n",
1076
+ "\n",
1077
+ "new_image.save('output/seg_result.png')"
1078
+ ]
1079
+ },
1080
+ {
1081
+ "cell_type": "code",
1082
+ "execution_count": null,
1083
+ "id": "37c0d695",
1084
+ "metadata": {},
1085
+ "outputs": [],
1086
+ "source": [
1087
+ "plt.figure(figsize=(12,4))\n",
1088
+ "\n",
1089
+ "plt.subplot(1,4,1)\n",
1090
+ "plt.imshow(house_real_image)\n",
1091
+ "plt.axis('off')\n",
1092
+ "plt.title('Input')\n",
1093
+ "plt.subplot(1,4,2)\n",
1094
+ "plt.imshow((255-np.array(house_real_image))*(255-np.array(house_mask)))\n",
1095
+ "plt.axis('off')\n",
1096
+ "plt.title('Masked')\n",
1097
+ "plt.subplot(1,4,3)\n",
1098
+ "plt.imshow(seg_image)\n",
1099
+ "plt.axis('off')\n",
1100
+ "plt.title('Condition')\n",
1101
+ "plt.subplot(1,4,4)\n",
1102
+ "plt.imshow(new_image)\n",
1103
+ "plt.title('Output')\n",
1104
+ "plt.axis('off')\n",
1105
+ "\n",
1106
+ "plt.savefig('output/seg_grid.png',\n",
1107
+ " dpi=200,\n",
1108
+ " bbox_inches='tight',\n",
1109
+ " pad_inches=0.0\n",
1110
+ " )"
1111
+ ]
1112
+ },
1113
+ {
1114
+ "cell_type": "code",
1115
+ "execution_count": null,
1116
+ "id": "7b8346f7",
1117
+ "metadata": {},
1118
+ "outputs": [],
1119
+ "source": []
1120
+ }
1121
+ ],
1122
+ "metadata": {
1123
+ "language_info": {
1124
+ "name": "python",
1125
+ "pygments_lexer": "ipython3"
1126
+ }
1127
+ },
1128
+ "nbformat": 4,
1129
+ "nbformat_minor": 5
1130
+ }
ControlNetInpaint/LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2023 mikonvergence
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
ControlNetInpaint/README.md ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # :recycle: ControlNetInpaint
2
+ [![Open In Collab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/mikonvergence/ControlNetInpaint/blob/main/ControlNet-with-Inpaint-Demo-colab.ipynb)
3
+
4
+ [ControlNet](https://github.com/lllyasviel/ControlNet) has proven to be a great tool for guiding StableDiffusion models with image-based hints! But what about **changing only a part of the image** based on that hint?
5
+
6
+ :crystal_ball: The initial set of models of ControlNet were not trained to work with StableDiffusion inpainting backbone, but it turns out that the results can be pretty good!
7
+
8
+ In this repository, you will find a basic example notebook that shows how this can work. **The key trick is to use the right value of the parameter** `controlnet_conditioning_scale` - while value of `1.0` often works well, it is sometimes beneficial to bring it down a bit when the controlling image does not fit the selected text prompt very well.
9
+
10
+ ## Demos on 🤗 HuggingFace Using ControlNetInpaint
11
+ ### :pencil2: Mask and Sketch
12
+ Check out the [HuggingFace Space](https://huggingface.co/spaces/mikonvergence/mask-and-sketch) which allows you to scribble and describe how you want to recreate a part of an image:
13
+ [<img width="1518" alt="Screenshot 2023-04-16 at 11 56 29" src="https://user-images.githubusercontent.com/13435425/232302552-123744ba-4953-4972-9df8-ab19ee7b599b.png">](https://huggingface.co/spaces/mikonvergence/mask-and-sketch)
14
+
15
+ ### :performing_arts:theaTRON
16
+ Check out the [HuggingFace Space](https://huggingface.co/spaces/mikonvergence/theaTRON) that reimagines scenes with human subjects using a text prompt:
17
+ [<img width="1518" alt="theaTRON tool examples" src="https://huggingface.co/spaces/mikonvergence/theaTRON/resolve/main/data/image-only.png">](https://huggingface.co/spaces/mikonvergence/theaTRON)
18
+
19
+ ## Code Usage
20
+ > This code is currently compatible with `diffusers==0.14.0`. An upgrade to the latest version can be expected in the near future (currently, some breaking changes are present in `0.15.0` that should ideally be fixed on the side of the diffusers interface).
21
+
22
+ Here's an example of how this new pipeline (`StableDiffusionControlNetInpaintPipeline`) is used with the core backbone of `"runwayml/stable-diffusion-inpainting"`:
23
+ ```python
24
+ # load control net and stable diffusion v1-5
25
+ controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
26
+ pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
27
+ "runwayml/stable-diffusion-inpainting", controlnet=controlnet, torch_dtype=torch.float16
28
+ )
29
+
30
+ # speed up diffusion process with faster scheduler and memory optimization
31
+ pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
32
+ # remove following line if xformers is not installed
33
+ pipe.enable_xformers_memory_efficient_attention()
34
+
35
+ pipe.to('cuda')
36
+
37
+ # generate image
38
+ generator = torch.manual_seed(0)
39
+ new_image = pipe(
40
+ text_prompt,
41
+ num_inference_steps=20,
42
+ generator=generator,
43
+ image=image,
44
+ control_image=canny_image,
45
+ mask_image=mask_image
46
+ ).images[0]
47
+ ```
48
+ (Full example how to get images and run the results is available in the notebook!)
49
+
50
+ ## Results
51
+ All results below have been generated using the `ControlNet-with-Inpaint-Demo.ipynb` notebook.
52
+
53
+ Let's start with turning a dog into a red panda!
54
+ ### Canny Edge
55
+ **Prompt**: *"a red panda sitting on a bench"*
56
+
57
+ ![Canny Result](output/canny_grid.png)
58
+
59
+ ### HED
60
+ **Prompt**: *"a red panda sitting on a bench"*
61
+
62
+ ![HED Result](output/hed_grid.png)
63
+
64
+ ### Scribble
65
+ **Prompt**: *"a red panda sitting on a bench"*
66
+
67
+ ![Canny Result](output/scribble_grid.png)
68
+
69
+ ### Depth
70
+ **Prompt**: *"a red panda sitting on a bench"*
71
+
72
+ ![Canny Result](output/depth_grid.png)
73
+
74
+ ### Normal
75
+ **Prompt**: *"a red panda sitting on a bench"*
76
+
77
+ ![Normal Result](output/normal_grid.png)
78
+
79
+ For the remaining modalities, the panda example doesn't really make much sense, so we use different images and prompts to illustrate the capability!
80
+
81
+ ### M-LSD
82
+ **Prompt**: *"an image of a room with a city skyline view"*
83
+
84
+ ![MLSD Result](output/mlsd_grid.png)
85
+
86
+ ### OpenPose
87
+ **Prompt**: *"a man in a knight armor"*
88
+
89
+ ![Normal Result](output/openpose_grid.png)
90
+
91
+ ### Segmentation Mask
92
+ **Prompt**: *"a pink eerie scary house"*
93
+
94
+ ![Normal Result](output/seg_grid.png)
95
+
96
+ ## Challenging Example 🐕➡️🍔
97
+ Let's see how tuning the `controlnet_conditioning_scale` works out for a more challenging example of turning the dog into a cheeseburger!
98
+
99
+ In this case, we **demand a large semantic leap** and that requires a more subtle guide from the control image!
100
+
101
+ ![Cheeseburger Result](output/canny_cheeseburger_grid.png)
102
+
103
+ ### :fast_forward: DiffusionFastForward: learn diffusion from ground up! 🎻
104
+ If you want to learn more about the process of denoising diffusion for images, check out the **open-source course** [DiffusionFastForward](https://github.com/mikonvergence/DiffusionFastForward) with colab notebooks where networks are trained from scratch on high-resolution data! :beginner:
105
+
106
+ [![Logo](https://user-images.githubusercontent.com/13435425/222425743-213279f9-d0a1-413c-a16a-2c88b512f827.png)](https://github.com/mikonvergence/DiffusionFastForward)
107
+
108
+ ### Acknowledgement
109
+ There is a related excellent repository of [ControlNet-for-Any-Basemodel](https://github.com/haofanwang/ControlNet-for-Diffusers) that, among many other things, also shows similar examples of using ControlNet for inpainting. However, that definition of the pipeline is quite different, but most importantly, does not allow for controlling the `controlnet_conditioning_scale` as an input argument.
110
+
111
+ There are other differences, such as the fact that in this implementation, only one pipeline needs to be instantiated (as opposed to two in the other one), but **the key motivation for publishing this repository is to provide a space solely focused on the application of ControlNet for inpainting.**
ControlNetInpaint/output/baseline_grid.png ADDED

Git LFS Details

  • SHA256: 11ebb2475096f74e87cae92bf87379d3917d1391be0054bd04410c69b2f05658
  • Pointer size: 132 Bytes
  • Size of remote file: 1.03 MB
ControlNetInpaint/output/baseline_result.png ADDED
ControlNetInpaint/output/canny_cheeseburger.png ADDED

Git LFS Details

  • SHA256: ad0894cf88bd7f24abfc0509570638aa349d85700673a260493c8184b2c7073a
  • Pointer size: 132 Bytes
  • Size of remote file: 1.31 MB
ControlNetInpaint/output/canny_cheeseburger_grid.png ADDED

Git LFS Details

  • SHA256: 70b7be97ade344bf207512592d44c0c9ed4e3607ce1e69e7f9969bbe2fdeedbe
  • Pointer size: 132 Bytes
  • Size of remote file: 3.65 MB
ControlNetInpaint/output/canny_grid.png ADDED
ControlNetInpaint/output/canny_result.png ADDED
ControlNetInpaint/output/depth_grid.png ADDED
ControlNetInpaint/output/depth_result.png ADDED
ControlNetInpaint/output/hed_grid.png ADDED
ControlNetInpaint/output/hed_result.png ADDED
ControlNetInpaint/output/mlsd_grid.png ADDED
ControlNetInpaint/output/mlsd_result.png ADDED
ControlNetInpaint/output/normal_grid.png ADDED
ControlNetInpaint/output/normal_result.png ADDED
ControlNetInpaint/output/openpose_grid.png ADDED
ControlNetInpaint/output/openpose_result.png ADDED
ControlNetInpaint/output/scribble_grid.png ADDED
ControlNetInpaint/output/scribble_result.png ADDED
ControlNetInpaint/output/seg_grid.png ADDED
ControlNetInpaint/output/seg_result.png ADDED
ControlNetInpaint/setup.py ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from setuptools import setup
2
+
3
+ setup(
4
+ name='controlnetinpaint',
5
+ version='0.1',
6
+ description='ControlNet Inpainting with StableDiffusion',
7
+ url='https://github.com/mikonvergence/ControlNetInpaint',
8
+ author='Mikolaj Czerkawski',
9
+ author_email="[email protected]",
10
+ package_dir={"controlnetinpaint":"src"},
11
+ install_requires=[
12
+ "torch>=1.10.0",
13
+ "torchvision",
14
+ "numpy",
15
+ "tqdm",
16
+ "pillow",
17
+ "diffusers==0.14.0",
18
+ "xformers",
19
+ "transformers",
20
+ "scipy",
21
+ "ftfy",
22
+ "accelerate",
23
+ "controlnet_aux"
24
+ ],
25
+ )
ControlNetInpaint/src/pipeline_stable_diffusion_controlnet_inpaint.py ADDED
@@ -0,0 +1,521 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright 2023 The HuggingFace Team. All rights reserved.
2
+ #
3
+ # Licensed under the Apache License, Version 2.0 (the "License");
4
+ # you may not use this file except in compliance with the License.
5
+ # You may obtain a copy of the License at
6
+ #
7
+ # http://www.apache.org/licenses/LICENSE-2.0
8
+ #
9
+ # Unless required by applicable law or agreed to in writing, software
10
+ # distributed under the License is distributed on an "AS IS" BASIS,
11
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12
+ # See the License for the specific language governing permissions and
13
+ # limitations under the License.
14
+
15
+ import torch
16
+ import PIL.Image
17
+ import numpy as np
18
+
19
+ from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_controlnet import *
20
+
21
+ EXAMPLE_DOC_STRING = """
22
+ Examples:
23
+ ```py
24
+ >>> # !pip install opencv-python transformers accelerate
25
+ >>> from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler
26
+ >>> from diffusers.utils import load_image
27
+ >>> import numpy as np
28
+ >>> import torch
29
+
30
+ >>> import cv2
31
+ >>> from PIL import Image
32
+ >>> # download an image
33
+ >>> image = load_image(
34
+ ... "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
35
+ ... )
36
+ >>> image = np.array(image)
37
+ >>> mask_image = load_image(
38
+ ... "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
39
+ ... )
40
+ >>> mask_image = np.array(mask_image)
41
+ >>> # get canny image
42
+ >>> canny_image = cv2.Canny(image, 100, 200)
43
+ >>> canny_image = canny_image[:, :, None]
44
+ >>> canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2)
45
+ >>> canny_image = Image.fromarray(canny_image)
46
+
47
+ >>> # load control net and stable diffusion v1-5
48
+ >>> controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
49
+ >>> pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained(
50
+ ... "runwayml/stable-diffusion-inpainting", controlnet=controlnet, torch_dtype=torch.float16
51
+ ... )
52
+
53
+ >>> # speed up diffusion process with faster scheduler and memory optimization
54
+ >>> pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
55
+ >>> # remove following line if xformers is not installed
56
+ >>> pipe.enable_xformers_memory_efficient_attention()
57
+
58
+ >>> pipe.enable_model_cpu_offload()
59
+
60
+ >>> # generate image
61
+ >>> generator = torch.manual_seed(0)
62
+ >>> image = pipe(
63
+ ... "futuristic-looking doggo",
64
+ ... num_inference_steps=20,
65
+ ... generator=generator,
66
+ ... image=image,
67
+ ... control_image=canny_image,
68
+ ... mask_image=mask_image
69
+ ... ).images[0]
70
+ ```
71
+ """
72
+
73
+
74
+ def prepare_mask_and_masked_image(image, mask):
75
+ """
76
+ Prepares a pair (image, mask) to be consumed by the Stable Diffusion pipeline. This means that those inputs will be
77
+ converted to ``torch.Tensor`` with shapes ``batch x channels x height x width`` where ``channels`` is ``3`` for the
78
+ ``image`` and ``1`` for the ``mask``.
79
+ The ``image`` will be converted to ``torch.float32`` and normalized to be in ``[-1, 1]``. The ``mask`` will be
80
+ binarized (``mask > 0.5``) and cast to ``torch.float32`` too.
81
+ Args:
82
+ image (Union[np.array, PIL.Image, torch.Tensor]): The image to inpaint.
83
+ It can be a ``PIL.Image``, or a ``height x width x 3`` ``np.array`` or a ``channels x height x width``
84
+ ``torch.Tensor`` or a ``batch x channels x height x width`` ``torch.Tensor``.
85
+ mask (_type_): The mask to apply to the image, i.e. regions to inpaint.
86
+ It can be a ``PIL.Image``, or a ``height x width`` ``np.array`` or a ``1 x height x width``
87
+ ``torch.Tensor`` or a ``batch x 1 x height x width`` ``torch.Tensor``.
88
+ Raises:
89
+ ValueError: ``torch.Tensor`` images should be in the ``[-1, 1]`` range. ValueError: ``torch.Tensor`` mask
90
+ should be in the ``[0, 1]`` range. ValueError: ``mask`` and ``image`` should have the same spatial dimensions.
91
+ TypeError: ``mask`` is a ``torch.Tensor`` but ``image`` is not
92
+ (ot the other way around).
93
+ Returns:
94
+ tuple[torch.Tensor]: The pair (mask, masked_image) as ``torch.Tensor`` with 4
95
+ dimensions: ``batch x channels x height x width``.
96
+ """
97
+ if isinstance(image, torch.Tensor):
98
+ if not isinstance(mask, torch.Tensor):
99
+ raise TypeError(f"`image` is a torch.Tensor but `mask` (type: {type(mask)} is not")
100
+
101
+ # Batch single image
102
+ if image.ndim == 3:
103
+ assert image.shape[0] == 3, "Image outside a batch should be of shape (3, H, W)"
104
+ image = image.unsqueeze(0)
105
+
106
+ # Batch and add channel dim for single mask
107
+ if mask.ndim == 2:
108
+ mask = mask.unsqueeze(0).unsqueeze(0)
109
+
110
+ # Batch single mask or add channel dim
111
+ if mask.ndim == 3:
112
+ # Single batched mask, no channel dim or single mask not batched but channel dim
113
+ if mask.shape[0] == 1:
114
+ mask = mask.unsqueeze(0)
115
+
116
+ # Batched masks no channel dim
117
+ else:
118
+ mask = mask.unsqueeze(1)
119
+
120
+ assert image.ndim == 4 and mask.ndim == 4, "Image and Mask must have 4 dimensions"
121
+ assert image.shape[-2:] == mask.shape[-2:], "Image and Mask must have the same spatial dimensions"
122
+ assert image.shape[0] == mask.shape[0], "Image and Mask must have the same batch size"
123
+
124
+ # Check image is in [-1, 1]
125
+ if image.min() < -1 or image.max() > 1:
126
+ raise ValueError("Image should be in [-1, 1] range")
127
+
128
+ # Check mask is in [0, 1]
129
+ if mask.min() < 0 or mask.max() > 1:
130
+ raise ValueError("Mask should be in [0, 1] range")
131
+
132
+ # Binarize mask
133
+ mask[mask < 0.5] = 0
134
+ mask[mask >= 0.5] = 1
135
+
136
+ # Image as float32
137
+ image = image.to(dtype=torch.float32)
138
+ elif isinstance(mask, torch.Tensor):
139
+ raise TypeError(f"`mask` is a torch.Tensor but `image` (type: {type(image)} is not")
140
+ else:
141
+ # preprocess image
142
+ if isinstance(image, (PIL.Image.Image, np.ndarray)):
143
+ image = [image]
144
+
145
+ if isinstance(image, list) and isinstance(image[0], PIL.Image.Image):
146
+ image = [np.array(i.convert("RGB"))[None, :] for i in image]
147
+ image = np.concatenate(image, axis=0)
148
+ elif isinstance(image, list) and isinstance(image[0], np.ndarray):
149
+ image = np.concatenate([i[None, :] for i in image], axis=0)
150
+
151
+ image = image.transpose(0, 3, 1, 2)
152
+ image = torch.from_numpy(image).to(dtype=torch.float32) / 127.5 - 1.0
153
+
154
+ # preprocess mask
155
+ if isinstance(mask, (PIL.Image.Image, np.ndarray)):
156
+ mask = [mask]
157
+
158
+ if isinstance(mask, list) and isinstance(mask[0], PIL.Image.Image):
159
+ mask = np.concatenate([np.array(m.convert("L"))[None, None, :] for m in mask], axis=0)
160
+ mask = mask.astype(np.float32) / 255.0
161
+ elif isinstance(mask, list) and isinstance(mask[0], np.ndarray):
162
+ mask = np.concatenate([m[None, None, :] for m in mask], axis=0)
163
+
164
+ mask[mask < 0.5] = 0
165
+ mask[mask >= 0.5] = 1
166
+ mask = torch.from_numpy(mask)
167
+
168
+ masked_image = image * (mask < 0.5)
169
+
170
+ return mask, masked_image
171
+
172
+ class StableDiffusionControlNetInpaintPipeline(StableDiffusionControlNetPipeline):
173
+ r"""
174
+ Pipeline for text-guided image inpainting using Stable Diffusion with ControlNet guidance.
175
+
176
+ This model inherits from [`StableDiffusionControlNetPipeline`]. Check the superclass documentation for the generic methods the
177
+ library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)
178
+
179
+ Args:
180
+ vae ([`AutoencoderKL`]):
181
+ Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations.
182
+ text_encoder ([`CLIPTextModel`]):
183
+ Frozen text-encoder. Stable Diffusion uses the text portion of
184
+ [CLIP](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel), specifically
185
+ the [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant.
186
+ tokenizer (`CLIPTokenizer`):
187
+ Tokenizer of class
188
+ [CLIPTokenizer](https://huggingface.co/docs/transformers/v4.21.0/en/model_doc/clip#transformers.CLIPTokenizer).
189
+ unet ([`UNet2DConditionModel`]): Conditional U-Net architecture to denoise the encoded image latents.
190
+ controlnet ([`ControlNetModel`]):
191
+ Provides additional conditioning to the unet during the denoising process
192
+ scheduler ([`SchedulerMixin`]):
193
+ A scheduler to be used in combination with `unet` to denoise the encoded image latents. Can be one of
194
+ [`DDIMScheduler`], [`LMSDiscreteScheduler`], or [`PNDMScheduler`].
195
+ safety_checker ([`StableDiffusionSafetyChecker`]):
196
+ Classification module that estimates whether generated images could be considered offensive or harmful.
197
+ Please, refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for details.
198
+ feature_extractor ([`CLIPFeatureExtractor`]):
199
+ Model that extracts features from generated images to be used as inputs for the `safety_checker`.
200
+ """
201
+
202
+ def prepare_mask_latents(
203
+ self, mask, masked_image, batch_size, height, width, dtype, device, generator, do_classifier_free_guidance
204
+ ):
205
+ # resize the mask to latents shape as we concatenate the mask to the latents
206
+ # we do that before converting to dtype to avoid breaking in case we're using cpu_offload
207
+ # and half precision
208
+ mask = torch.nn.functional.interpolate(
209
+ mask, size=(height // self.vae_scale_factor, width // self.vae_scale_factor)
210
+ )
211
+ mask = mask.to(device=device, dtype=dtype)
212
+
213
+ masked_image = masked_image.to(device=device, dtype=dtype)
214
+
215
+ # encode the mask image into latents space so we can concatenate it to the latents
216
+ if isinstance(generator, list):
217
+ masked_image_latents = [
218
+ self.vae.encode(masked_image[i : i + 1]).latent_dist.sample(generator=generator[i])
219
+ for i in range(batch_size)
220
+ ]
221
+ masked_image_latents = torch.cat(masked_image_latents, dim=0)
222
+ else:
223
+ masked_image_latents = self.vae.encode(masked_image).latent_dist.sample(generator=generator)
224
+ masked_image_latents = self.vae.config.scaling_factor * masked_image_latents
225
+
226
+ # duplicate mask and masked_image_latents for each generation per prompt, using mps friendly method
227
+ if mask.shape[0] < batch_size:
228
+ if not batch_size % mask.shape[0] == 0:
229
+ raise ValueError(
230
+ "The passed mask and the required batch size don't match. Masks are supposed to be duplicated to"
231
+ f" a total batch size of {batch_size}, but {mask.shape[0]} masks were passed. Make sure the number"
232
+ " of masks that you pass is divisible by the total requested batch size."
233
+ )
234
+ mask = mask.repeat(batch_size // mask.shape[0], 1, 1, 1)
235
+ if masked_image_latents.shape[0] < batch_size:
236
+ if not batch_size % masked_image_latents.shape[0] == 0:
237
+ raise ValueError(
238
+ "The passed images and the required batch size don't match. Images are supposed to be duplicated"
239
+ f" to a total batch size of {batch_size}, but {masked_image_latents.shape[0]} images were passed."
240
+ " Make sure the number of images that you pass is divisible by the total requested batch size."
241
+ )
242
+ masked_image_latents = masked_image_latents.repeat(batch_size // masked_image_latents.shape[0], 1, 1, 1)
243
+
244
+ mask = torch.cat([mask] * 2) if do_classifier_free_guidance else mask
245
+ masked_image_latents = (
246
+ torch.cat([masked_image_latents] * 2) if do_classifier_free_guidance else masked_image_latents
247
+ )
248
+
249
+ # aligning device to prevent device errors when concating it with the latent model input
250
+ masked_image_latents = masked_image_latents.to(device=device, dtype=dtype)
251
+ return mask, masked_image_latents
252
+
253
+ @torch.no_grad()
254
+ @replace_example_docstring(EXAMPLE_DOC_STRING)
255
+ def __call__(
256
+ self,
257
+ prompt: Union[str, List[str]] = None,
258
+ image: Union[torch.FloatTensor, PIL.Image.Image] = None,
259
+ control_image: Union[torch.FloatTensor, PIL.Image.Image, List[torch.FloatTensor], List[PIL.Image.Image]] = None,
260
+ mask_image: Union[torch.FloatTensor, PIL.Image.Image] = None,
261
+ height: Optional[int] = None,
262
+ width: Optional[int] = None,
263
+ num_inference_steps: int = 50,
264
+ guidance_scale: float = 7.5,
265
+ negative_prompt: Optional[Union[str, List[str]]] = None,
266
+ num_images_per_prompt: Optional[int] = 1,
267
+ eta: float = 0.0,
268
+ generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
269
+ latents: Optional[torch.FloatTensor] = None,
270
+ prompt_embeds: Optional[torch.FloatTensor] = None,
271
+ negative_prompt_embeds: Optional[torch.FloatTensor] = None,
272
+ output_type: Optional[str] = "pil",
273
+ return_dict: bool = True,
274
+ callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
275
+ callback_steps: int = 1,
276
+ cross_attention_kwargs: Optional[Dict[str, Any]] = None,
277
+ controlnet_conditioning_scale: float = 1.0,
278
+ ):
279
+ r"""
280
+ Function invoked when calling the pipeline for generation.
281
+ Args:
282
+ prompt (`str` or `List[str]`, *optional*):
283
+ The prompt or prompts to guide the image generation. If not defined, one has to pass `prompt_embeds`.
284
+ instead.
285
+ image (`PIL.Image.Image`):
286
+ `Image`, or tensor representing an image batch which will be inpainted, *i.e.* parts of the image will
287
+ be masked out with `mask_image` and repainted according to `prompt`.
288
+ control_image (`torch.FloatTensor`, `PIL.Image.Image`, `List[torch.FloatTensor]` or `List[PIL.Image.Image]`):
289
+ The ControlNet input condition. ControlNet uses this input condition to generate guidance to Unet. If
290
+ the type is specified as `Torch.FloatTensor`, it is passed to ControlNet as is. PIL.Image.Image` can
291
+ also be accepted as an image. The control image is automatically resized to fit the output image.
292
+ mask_image (`PIL.Image.Image`):
293
+ `Image`, or tensor representing an image batch, to mask `image`. White pixels in the mask will be
294
+ repainted, while black pixels will be preserved. If `mask_image` is a PIL image, it will be converted
295
+ to a single channel (luminance) before use. If it's a tensor, it should contain one color channel (L)
296
+ instead of 3, so the expected shape would be `(B, H, W, 1)`.
297
+ height (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
298
+ The height in pixels of the generated image.
299
+ width (`int`, *optional*, defaults to self.unet.config.sample_size * self.vae_scale_factor):
300
+ The width in pixels of the generated image.
301
+ num_inference_steps (`int`, *optional*, defaults to 50):
302
+ The number of denoising steps. More denoising steps usually lead to a higher quality image at the
303
+ expense of slower inference.
304
+ guidance_scale (`float`, *optional*, defaults to 7.5):
305
+ Guidance scale as defined in [Classifier-Free Diffusion Guidance](https://arxiv.org/abs/2207.12598).
306
+ `guidance_scale` is defined as `w` of equation 2. of [Imagen
307
+ Paper](https://arxiv.org/pdf/2205.11487.pdf). Guidance scale is enabled by setting `guidance_scale >
308
+ 1`. Higher guidance scale encourages to generate images that are closely linked to the text `prompt`,
309
+ usually at the expense of lower image quality.
310
+ negative_prompt (`str` or `List[str]`, *optional*):
311
+ The prompt or prompts not to guide the image generation. If not defined, one has to pass
312
+ `negative_prompt_embeds`. instead. If not defined, one has to pass `negative_prompt_embeds`. instead.
313
+ Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).
314
+ num_images_per_prompt (`int`, *optional*, defaults to 1):
315
+ The number of images to generate per prompt.
316
+ eta (`float`, *optional*, defaults to 0.0):
317
+ Corresponds to parameter eta (η) in the DDIM paper: https://arxiv.org/abs/2010.02502. Only applies to
318
+ [`schedulers.DDIMScheduler`], will be ignored for others.
319
+ generator (`torch.Generator` or `List[torch.Generator]`, *optional*):
320
+ One or a list of [torch generator(s)](https://pytorch.org/docs/stable/generated/torch.Generator.html)
321
+ to make generation deterministic.
322
+ latents (`torch.FloatTensor`, *optional*):
323
+ Pre-generated noisy latents, sampled from a Gaussian distribution, to be used as inputs for image
324
+ generation. Can be used to tweak the same generation with different prompts. If not provided, a latents
325
+ tensor will ge generated by sampling using the supplied random `generator`.
326
+ prompt_embeds (`torch.FloatTensor`, *optional*):
327
+ Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
328
+ provided, text embeddings will be generated from `prompt` input argument.
329
+ negative_prompt_embeds (`torch.FloatTensor`, *optional*):
330
+ Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
331
+ weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
332
+ argument.
333
+ output_type (`str`, *optional*, defaults to `"pil"`):
334
+ The output format of the generate image. Choose between
335
+ [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`.
336
+ return_dict (`bool`, *optional*, defaults to `True`):
337
+ Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a
338
+ plain tuple.
339
+ callback (`Callable`, *optional*):
340
+ A function that will be called every `callback_steps` steps during inference. The function will be
341
+ called with the following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`.
342
+ callback_steps (`int`, *optional*, defaults to 1):
343
+ The frequency at which the `callback` function will be called. If not specified, the callback will be
344
+ called at every step.
345
+ cross_attention_kwargs (`dict`, *optional*):
346
+ A kwargs dictionary that if specified is passed along to the `AttnProcessor` as defined under
347
+ `self.processor` in
348
+ [diffusers.cross_attention](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/cross_attention.py).
349
+ controlnet_conditioning_scale (`float`, *optional*, defaults to 1.0):
350
+ The outputs of the controlnet are multiplied by `controlnet_conditioning_scale` before they are added
351
+ to the residual in the original unet.
352
+ Examples:
353
+ Returns:
354
+ [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`:
355
+ [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] if `return_dict` is True, otherwise a `tuple.
356
+ When returning a tuple, the first element is a list with the generated images, and the second element is a
357
+ list of `bool`s denoting whether the corresponding generated image likely represents "not-safe-for-work"
358
+ (nsfw) content, according to the `safety_checker`.
359
+ """
360
+ # 0. Default height and width to unet
361
+ height, width = self._default_height_width(height, width, control_image)
362
+
363
+ # 1. Check inputs. Raise error if not correct
364
+ self.check_inputs(
365
+ prompt, control_image, height, width, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds
366
+ )
367
+
368
+ # 2. Define call parameters
369
+ if prompt is not None and isinstance(prompt, str):
370
+ batch_size = 1
371
+ elif prompt is not None and isinstance(prompt, list):
372
+ batch_size = len(prompt)
373
+ else:
374
+ batch_size = prompt_embeds.shape[0]
375
+
376
+ device = self._execution_device
377
+ # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
378
+ # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
379
+ # corresponds to doing no classifier free guidance.
380
+ do_classifier_free_guidance = guidance_scale > 1.0
381
+
382
+ # 3. Encode input prompt
383
+ prompt_embeds = self._encode_prompt(
384
+ prompt,
385
+ device,
386
+ num_images_per_prompt,
387
+ do_classifier_free_guidance,
388
+ negative_prompt,
389
+ prompt_embeds=prompt_embeds,
390
+ negative_prompt_embeds=negative_prompt_embeds,
391
+ )
392
+
393
+ # 4. Prepare image
394
+ control_image = self.prepare_image(
395
+ control_image,
396
+ width,
397
+ height,
398
+ batch_size * num_images_per_prompt,
399
+ num_images_per_prompt,
400
+ device,
401
+ self.controlnet.dtype,
402
+ )
403
+
404
+ if do_classifier_free_guidance:
405
+ control_image = torch.cat([control_image] * 2)
406
+
407
+ # 5. Prepare timesteps
408
+ self.scheduler.set_timesteps(num_inference_steps, device=device)
409
+ timesteps = self.scheduler.timesteps
410
+
411
+ # 6. Prepare latent variables
412
+ num_channels_latents = self.controlnet.config.in_channels
413
+ latents = self.prepare_latents(
414
+ batch_size * num_images_per_prompt,
415
+ num_channels_latents,
416
+ height,
417
+ width,
418
+ prompt_embeds.dtype,
419
+ device,
420
+ generator,
421
+ latents,
422
+ )
423
+
424
+ # EXTRA: prepare mask latents
425
+ mask, masked_image = prepare_mask_and_masked_image(image, mask_image)
426
+ mask, masked_image_latents = self.prepare_mask_latents(
427
+ mask,
428
+ masked_image,
429
+ batch_size * num_images_per_prompt,
430
+ height,
431
+ width,
432
+ prompt_embeds.dtype,
433
+ device,
434
+ generator,
435
+ do_classifier_free_guidance,
436
+ )
437
+
438
+ # 7. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
439
+ extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
440
+
441
+ # 8. Denoising loop
442
+ num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
443
+ with self.progress_bar(total=num_inference_steps) as progress_bar:
444
+ for i, t in enumerate(timesteps):
445
+ # expand the latents if we are doing classifier free guidance
446
+ latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
447
+ latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
448
+
449
+ down_block_res_samples, mid_block_res_sample = self.controlnet(
450
+ latent_model_input,
451
+ t,
452
+ encoder_hidden_states=prompt_embeds,
453
+ controlnet_cond=control_image,
454
+ return_dict=False,
455
+ )
456
+
457
+ down_block_res_samples = [
458
+ down_block_res_sample * controlnet_conditioning_scale
459
+ for down_block_res_sample in down_block_res_samples
460
+ ]
461
+ mid_block_res_sample *= controlnet_conditioning_scale
462
+
463
+ # predict the noise residual
464
+ latent_model_input = torch.cat([latent_model_input, mask, masked_image_latents], dim=1)
465
+ noise_pred = self.unet(
466
+ latent_model_input,
467
+ t,
468
+ encoder_hidden_states=prompt_embeds,
469
+ cross_attention_kwargs=cross_attention_kwargs,
470
+ down_block_additional_residuals=down_block_res_samples,
471
+ mid_block_additional_residual=mid_block_res_sample,
472
+ ).sample
473
+
474
+ # perform guidance
475
+ if do_classifier_free_guidance:
476
+ noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
477
+ noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
478
+
479
+ # compute the previous noisy sample x_t -> x_t-1
480
+ latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs).prev_sample
481
+
482
+ # call the callback, if provided
483
+ if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
484
+ progress_bar.update()
485
+ if callback is not None and i % callback_steps == 0:
486
+ callback(i, t, latents)
487
+
488
+ # If we do sequential model offloading, let's offload unet and controlnet
489
+ # manually for max memory savings
490
+ if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
491
+ self.unet.to("cpu")
492
+ self.controlnet.to("cpu")
493
+ torch.cuda.empty_cache()
494
+
495
+ if output_type == "latent":
496
+ image = latents
497
+ has_nsfw_concept = None
498
+ elif output_type == "pil":
499
+ # 8. Post-processing
500
+ image = self.decode_latents(latents)
501
+
502
+ # 9. Run safety checker
503
+ image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
504
+
505
+ # 10. Convert to PIL
506
+ image = self.numpy_to_pil(image)
507
+ else:
508
+ # 8. Post-processing
509
+ image = self.decode_latents(latents)
510
+
511
+ # 9. Run safety checker
512
+ image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
513
+
514
+ # Offload last model to CPU
515
+ if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
516
+ self.final_offload_hook.offload()
517
+
518
+ if not return_dict:
519
+ return (image, has_nsfw_concept)
520
+
521
+ return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)