zhaoyue-zephyrus commited on
Commit
2bc3344
·
1 Parent(s): afa0f0b

first commit

Browse files
README.md CHANGED
@@ -1,3 +1,53 @@
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-4.0
3
  ---
4
+
5
+ # QLIP
6
+
7
+ [\[📂 GitHub\]](https://github.com/NVlabs/QLIP)
8
+ [\[📃 QLIP Tech Report\]](http://arxiv.org/abs/2502.yyyyy)
9
+ [\[🔗 Project Page\]](http://nvlabs.github.io/QLIP/)
10
+ [\[🤗 HF Model\]](https://huggingface.co/NVIDIA/QLIP-L-14-392)
11
+
12
+ ## Introduction
13
+ We introduce Quantized Language-Image Pretraining (**QLIP**), a visual tokenization method that combines state-of-the-art reconstruction quality with state-of-the-art zero-shot image understanding.
14
+ QLIP trains a binary-spherical-quantization-based autoencoder with reconstruction and language-image alignment objectives.
15
+ We are the first to show that the two objectives do not need to be at odds.
16
+ We balance the two loss terms dynamically during training and show that a two-stage training pipeline effectively mixes the large-batch requirements of image-language pre-training with the memory bottleneck imposed by the reconstruction objective.
17
+ We validate the effectiveness of QLIP for multimodal understanding and text-conditioned image generation with a single model.
18
+ Specifically, QLIP serves as a drop-in replacement for the visual encoder for LLaVA and the image tokenizer for LlamaGen with comparable or even better performance.
19
+ Finally, we demonstrate that QLIP enables a unified mixed-modality auto-regressive model for understanding and generation.
20
+
21
+ ## Model Zoo
22
+ We provide the following models:
23
+ | model name | #bits | CR<sub>&uarr;<sub> | 0-shot<sub>&uarr;<sub> | rFID<sub>&darr;<sub> | HF Link |
24
+ | ------------- | ------ | ----- | ------ | ---- | ------- |
25
+ | QLIP-B-16-256 | 28 | 219.4 | 74.3 | 3.21 | [🤗 link](https://huggingface.co/NVIDIA/QLIP-B-16-256) |
26
+ | QLIP-B-8-256 | 28 | 54.8 | 75.6 | 0.70 | [🤗 link](https://huggingface.co/NVIDIA/QLIP-B-8-256) |
27
+ | QLIP-L-14-392 | 28 | 168 | 79.1 | 1.46 | [🤗 link](https://huggingface.co/NVIDIA/QLIP-L-14-392) |
28
+
29
+ Note:
30
+ - **CR**: compression ratio = 24/(#bits)*patch_size^2;
31
+ - **0-shot**: zero-shot classification accuracy on IN-1k-val;
32
+ - **rFID**: reconstruction FID on IN-1k-val.
33
+
34
+ ## Citing QLIP
35
+
36
+ ```bibtex
37
+ @article{zhao2025qlip,
38
+ title={QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation},
39
+ author={Zhao, Yue and Xue, Fuzhao and Reed, Scott and Fan, Linxi and Zhu, Yuke and Kautz, Jan and Yu, Zhiding and Krähenbühl, Philipp and Huang, De-An},
40
+ journal={arXiv preprint arXiv:2502.yyyyy},
41
+ year={2025}
42
+ }
43
+ ```
44
+
45
+ ## Acknowledgement
46
+ The project builds upon the following open-source efforts:
47
+ - [EVA-CLIP](https://github.com/baaivision/EVA/tree/master/EVA-CLIP/rei): We use EVA-CLIP as initialization which significantly speeds up the training convergence.
48
+
49
+ - [LLaVA](https://github.com/haotian-liu/LLaVA): We use LLaVA to evaluate the multimodal understanding performance.
50
+
51
+ - [LlamaGen](https://github.com/FoundationVision/LlamaGen): We build the text-to-image generation evaluation on top of LlamaGen.
52
+
53
+ - [Lingua](https://github.com/facebookresearch/lingua): We build the unified multimodal model on top of Lingua.
bsq.py ADDED
@@ -0,0 +1,227 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2024, NVIDIA Corporation & Affiliates. All rights reserved.
2
+ #
3
+ # This work is made available under the Nvidia Source Code License-NC.
4
+ # To view a copy of this license, visit
5
+ # https://github.com/NVlabs/QLIP/blob/main/LICENSE
6
+
7
+ # MIT License
8
+ # Based on https://github.com/zhaoyue-zephyrus/bsq-vit/blob/main/transcoder/models/quantizer/bsq.py
9
+
10
+ import torch
11
+ import torch.nn as nn
12
+ from einops import rearrange, reduce
13
+
14
+ _EPS = 1e-8
15
+
16
+
17
+ class DifferentiableEntropyFunction(torch.autograd.Function):
18
+ @staticmethod
19
+ def forward(ctx, zq, basis, K, eps):
20
+ zb = (zq + 1) / 2
21
+ zi = ((zb * basis).sum(-1)).to(torch.int64)
22
+ cnt = torch.scatter_reduce(
23
+ torch.zeros(2**K, device=zq.device, dtype=zq.dtype),
24
+ 0,
25
+ zi.flatten(),
26
+ torch.ones_like(zi.flatten()).to(zq.dtype),
27
+ "sum",
28
+ )
29
+ prob = (cnt + eps) / (cnt + eps).sum()
30
+ H = torch.special.entr(prob).sum()
31
+ ctx.save_for_backward(zq, zi, prob)
32
+ ctx.K = K
33
+ return H
34
+
35
+ @staticmethod
36
+ def backward(ctx, grad_output):
37
+ zq, zi, prob = ctx.saved_tensors
38
+ grad_array = -grad_output * (torch.log(prob) + 1) / zi.numel() / ctx.K
39
+ reord_grad = grad_array[zi.flatten()].reshape(zi.shape)
40
+ grad_input = reord_grad.unsqueeze(-1) * zq
41
+ return grad_input, None, None, None, None
42
+
43
+
44
+ def codebook_entropy(zq, basis, K, eps=1e-8):
45
+ return DifferentiableEntropyFunction.apply(zq, basis, K, eps)
46
+
47
+
48
+ class BinarySphericalQuantizer(nn.Module):
49
+ def __init__(
50
+ self,
51
+ embed_dim: int = 18,
52
+ group_size: int = 9,
53
+ soft_entropy: bool = True,
54
+ beta: float = 0.0, # commit loss
55
+ gamma_0: float = 1.0, # entropy loss (E[H(q)])
56
+ gamma_1: float = 1.0, # entropy loss (H[E[q]])
57
+ input_format: str = "bchw",
58
+ persample_entropy_compute: str = "group",
59
+ l2_norm: bool = True,
60
+ inv_temperature: float = 100.0,
61
+ ):
62
+ super().__init__()
63
+ self.embed_dim = embed_dim
64
+ self.group_size = group_size
65
+ assert embed_dim % group_size == 0, "embed_dim must be divisible by group_size"
66
+ self.soft_entropy = soft_entropy
67
+ self.beta = beta
68
+ self.gamma_0 = gamma_0
69
+ self.gamma_1 = gamma_1
70
+ assert input_format in ["bchw", "blc"]
71
+ self.input_format = input_format
72
+ assert persample_entropy_compute in [
73
+ "group",
74
+ "analytical",
75
+ ], "persample_entropy_compute must be either 'group' or 'analytical'"
76
+ self.persample_entropy_compute = persample_entropy_compute
77
+ self.l2_norm = l2_norm
78
+ self.inv_temperature = inv_temperature
79
+
80
+ self.register_buffer("basis", 2 ** torch.arange(embed_dim - 1, -1, -1), persistent=False)
81
+ self.register_buffer(
82
+ "group_basis", 2 ** torch.arange(group_size - 1, -1, -1), persistent=False
83
+ )
84
+
85
+ group_codes = torch.arange(2**self.group_size)
86
+ group_codebook = self.indexes_to_codes(group_codes).float()[:, -group_size:]
87
+ self.register_buffer("group_codebook", group_codebook, persistent=False)
88
+
89
+ def quantize(self, z):
90
+ assert (
91
+ z.shape[-1] == self.embed_dim
92
+ ), f"Expected {self.embed_dim} dimensions, got {z.shape[-1]}"
93
+ zhat = torch.where(z > 0, torch.ones_like(z), -torch.ones_like(z))
94
+ return z + (zhat - z).detach()
95
+
96
+ def forward(self, z):
97
+ if self.input_format == "bchw":
98
+ z = rearrange(z, "b c h w -> b h w c")
99
+ zq = self.quantize(z)
100
+
101
+ indices = self.codes_to_indexes(zq.detach())
102
+ group_indices = self.codes_to_group_indexes(zq.detach())
103
+
104
+ if not self.training:
105
+ used_codes = torch.unique(indices, return_counts=False)
106
+ else:
107
+ used_codes = None
108
+
109
+ if self.soft_entropy:
110
+ persample_entropy, cb_entropy = self.soft_entropy_loss(z)
111
+ else:
112
+ persample_entropy, cb_entropy = self.hard_entropy_loss(z)
113
+ entropy_penalty = self.gamma_0 * persample_entropy - self.gamma_1 * cb_entropy
114
+
115
+ q_scale = 1.0 / (self.embed_dim**0.5) if self.l2_norm else 1.0
116
+ zq = zq * q_scale
117
+ commit_loss = self.beta * torch.mean(((zq.detach() - z) ** 2).sum(dim=-1))
118
+
119
+ if self.input_format == "bchw":
120
+ zq = rearrange(zq, "b h w c -> b c h w")
121
+
122
+ return (
123
+ zq,
124
+ commit_loss + entropy_penalty / self.inv_temperature,
125
+ {
126
+ "H": cb_entropy,
127
+ "used_codes": used_codes,
128
+ "indices": indices,
129
+ "group_indices": group_indices,
130
+ },
131
+ )
132
+
133
+ def soft_entropy_loss(self, z):
134
+ group_codebook = self.group_codebook / (self.embed_dim**0.5 if self.l2_norm else 1)
135
+ divided_z = rearrange(z, "... (g c) -> ... g c", c=self.group_size)
136
+
137
+ if self.persample_entropy_compute == "group":
138
+ distance = -2 * torch.einsum("... g c, d c -> ... g d", divided_z, group_codebook)
139
+ prob = (-distance * self.inv_temperature).softmax(dim=-1)
140
+ persample_entropy = torch.special.entr(prob + _EPS).sum((-1, -2)).mean()
141
+ else:
142
+ p = torch.sigmoid(
143
+ -4 * z / (self.embed_dim**0.5 if self.l2_norm else 1) * self.inv_temperature
144
+ )
145
+ prob = torch.stack([p, 1 - p], dim=-1)
146
+ persample_entropy = torch.special.entr(prob + _EPS).sum((-1, -2)).mean()
147
+
148
+ # macro average of the probability of each subgroup
149
+ avg_prob = reduce(prob, "... g d -> g d", "mean")
150
+ cb_entropy = torch.special.entr(avg_prob + _EPS).sum()
151
+
152
+ return persample_entropy, cb_entropy
153
+
154
+ def hard_entropy_loss(self, z):
155
+ zb = ((z + 1) / 2).reshape(z.shape[0], -1, z.shape[-1]).to(torch.float32)
156
+ prob_per_dim = zb.sum(1) / zb.shape[1]
157
+ prob = torch.stack([prob_per_dim, 1 - prob_per_dim], dim=-1)
158
+ persample_entropy = torch.special.entr(prob + _EPS).sum((-1, -2)).mean()
159
+ cb_entropy = codebook_entropy(z, self.basis, self.embed_dim)
160
+
161
+ return persample_entropy, cb_entropy
162
+
163
+ def codes_to_indexes(self, zhat):
164
+ """Converts a `code` to an index in the codebook.
165
+ Args:
166
+ zhat: A tensor of shape (B, ..., C) containing the codes. must be in {-1, 1}
167
+ """
168
+ assert (
169
+ zhat.shape[-1] == self.embed_dim
170
+ ), f"Expected {self.embed_dim} dimensions, got {zhat.shape[-1]}"
171
+ return ((zhat.int() + 1) / 2 * self.basis).sum(axis=-1).to(torch.int64)
172
+
173
+ def codes_to_group_indexes(self, zhat):
174
+ """Converts a `code` to a list of indexes (in groups) in the codebook.
175
+ Args:
176
+ zhat: A tensor of shape (B, ..., C) containing the codes. must be in {-1, 1}
177
+ """
178
+ zhat_in_group = rearrange(zhat, "b ... (g c) -> b ... g c", c=self.group_size)
179
+ return ((zhat_in_group.int() + 1) / 2 * self.group_basis).sum(axis=-1).to(torch.int64)
180
+
181
+ def indexes_to_codes(self, indices):
182
+ """Inverse of `codes_to_indexes`."""
183
+ indices = indices.unsqueeze(-1)
184
+ codes_non_centered = torch.remainder(torch.floor_divide(indices, self.basis), 2)
185
+ return codes_non_centered * 2 - 1
186
+
187
+ def group_indexes_to_codes(self, group_indices):
188
+ """Inverse of `codes_to_group_indexes`."""
189
+ group_indices = group_indices.unsqueeze(-1)
190
+ codes_non_centered = torch.remainder(torch.floor_divide(group_indices, self.group_basis), 2)
191
+ codes_non_centered = rearrange(codes_non_centered, "b ... g c -> b ... (g c)")
192
+ return codes_non_centered * 2 - 1
193
+
194
+ def get_group_codebook_entry(self, group_indices, one_hot=False):
195
+ """
196
+ Args:
197
+ group_indices: A tensor of shape (B, L, G, C) containing the group indices.
198
+ """
199
+ if one_hot:
200
+ z_q = group_indices @ self.group_codebook
201
+ else:
202
+ z_q = self.group_indexes_to_codes(group_indices)
203
+ q_scale = 1.0 / (self.embed_dim**0.5) if self.l2_norm else 1.0
204
+ z_q = z_q * q_scale
205
+ if self.input_format == "bchw":
206
+ h, w = int(z_q.shape[1] ** 0.5)
207
+ assert h * w == z_q.shape[1], "Invalid sequence length"
208
+ z_q = rearrange(z_q, "b (h w) c -> b c h w", h=h)
209
+ return z_q
210
+
211
+ def get_codebook_entry(self, indices, one_hot=False):
212
+ """
213
+ Args:
214
+ group_indices: A tensor of shape (B, L, C) containing the indices.
215
+ """
216
+ if one_hot:
217
+ assert self.embed_dim == self.group_size, "one_hot is only supported for group_size == embed_dim"
218
+ z_q = indices @ self.group_codebook
219
+ else:
220
+ z_q = self.indexes_to_codes(indices)
221
+ q_scale = 1.0 / (self.embed_dim**0.5) if self.l2_norm else 1.0
222
+ z_q = z_q * q_scale
223
+ if self.input_format == "bchw":
224
+ h, w = int(z_q.shape[1] ** 0.5)
225
+ assert h * w == z_q.shape[1], "Invalid sequence length"
226
+ z_q = rearrange(z_q, "b (h w) c -> b c h w", h=h)
227
+ return z_q
config.json ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "EVA-BSQCLIP",
3
+ "architectures": [
4
+ "QLIPModel"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_evaclip.QLIPConfig",
8
+ "AutoModel": "modeling_evaclip.QLIPModel"
9
+ },
10
+ "decoder_config": {
11
+ "dropout": 0.0,
12
+ "hidden_size": 1024,
13
+ "hidden_size_post_q": 1024,
14
+ "image_size": 392,
15
+ "intermediate_size": 2730,
16
+ "k_bias": false,
17
+ "layer_norm_eps": 1e-06,
18
+ "model_type": "clip_decoder_model",
19
+ "num_attention_heads": 16,
20
+ "num_hidden_layers": 24,
21
+ "patch_size": 14,
22
+ "rope": true,
23
+ "rope_shift": 0,
24
+ "subln": true,
25
+ "swiglu": true,
26
+ "use_bfloat16": true,
27
+ "use_rms_norm": true
28
+ },
29
+ "initializer_factor": 1.0,
30
+ "logit_scale_init_value": 2.6592,
31
+ "model_type": "clip",
32
+ "projection_dim": 768,
33
+ "text_config": {
34
+ "bos_token_id": 0,
35
+ "dropout": 0.0,
36
+ "eos_token_id": 2,
37
+ "hidden_size": 768,
38
+ "intermediate_size": 3072,
39
+ "num_attention_heads": 12,
40
+ "model_type": "clip_text_model",
41
+ "use_bfloat16": true,
42
+ "use_rms_norm": false
43
+ },
44
+ "text_projection_bias": false,
45
+ "torch_dtype": "float32",
46
+ "transformers_version": "4.37.2",
47
+ "vision_config": {
48
+ "dropout": 0.0,
49
+ "hidden_size": 1024,
50
+ "hidden_size_post_q": 1024,
51
+ "image_size": 392,
52
+ "intermediate_size": 2730,
53
+ "k_bias": false,
54
+ "layer_norm_eps": 1e-06,
55
+ "model_type": "clip_vision_model",
56
+ "num_attention_heads": 16,
57
+ "num_hidden_layers": 24,
58
+ "patch_size": 14,
59
+ "projection_dim": 1024,
60
+ "quantizer": "bsq",
61
+ "quantizer_cfg": {
62
+ "embed_dim": 28,
63
+ "group_size": 1,
64
+ "input_format": "blc",
65
+ "inv_temperature": 1.0,
66
+ "l2_norm": true
67
+ },
68
+ "quantizer_embed_type": "mlp",
69
+ "quantizer_l2_norm": true,
70
+ "rope": true,
71
+ "rope_shift": 1,
72
+ "subln": true,
73
+ "swiglu": true,
74
+ "use_bfloat16": true,
75
+ "use_rms_norm": true
76
+ },
77
+ "vision_projection_bias": true
78
+ }
79
+
configuration_qlip.py ADDED
@@ -0,0 +1,566 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2024, NVIDIA Corporation & Affiliates. All rights reserved.
2
+ #
3
+ # This work is made available under the Nvidia Source Code License-NC.
4
+ # To view a copy of this license, visit
5
+ # https://github.com/NVlabs/QLIP/blob/main/LICENSE
6
+
7
+ # Copyright 2021 The HuggingFace Inc. team. All rights reserved.
8
+ #
9
+ # Licensed under the Apache License, Version 2.0 (the "License");
10
+ # you may not use this file except in compliance with the License.
11
+ # You may obtain a copy of the License at
12
+ #
13
+ # http://www.apache.org/licenses/LICENSE-2.0
14
+ #
15
+ # Unless required by applicable law or agreed to in writing, software
16
+ # distributed under the License is distributed on an "AS IS" BASIS,
17
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18
+ # See the License for the specific language governing permissions and
19
+ # limitations under the License.
20
+ """ CLIP model configuration"""
21
+
22
+ import os
23
+ from collections import OrderedDict
24
+ from typing import TYPE_CHECKING, Any, Mapping, Optional, Union
25
+
26
+
27
+ if TYPE_CHECKING:
28
+ from transformers.processing_utils import ProcessorMixin
29
+ from transformers.utils import TensorType
30
+
31
+ from transformers.configuration_utils import PretrainedConfig
32
+ from transformers.onnx import OnnxConfig
33
+ from transformers.utils import logging
34
+
35
+
36
+ logger = logging.get_logger(__name__)
37
+
38
+
39
+ class QLIPTextConfig(PretrainedConfig):
40
+ r"""
41
+ This is the configuration class to store the configuration of a [`CLIPTextModel`]. It is used to instantiate a CLIP
42
+ text encoder according to the specified arguments, defining the model architecture. Instantiating a configuration
43
+ with the defaults will yield a similar configuration to that of the text encoder of the CLIP
44
+ [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
45
+
46
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
47
+ documentation from [`PretrainedConfig`] for more information.
48
+
49
+ Args:
50
+ vocab_size (`int`, *optional*, defaults to 49408):
51
+ Vocabulary size of the CLIP text model. Defines the number of different tokens that can be represented by
52
+ the `inputs_ids` passed when calling [`CLIPModel`].
53
+ hidden_size (`int`, *optional*, defaults to 512):
54
+ Dimensionality of the encoder layers and the pooler layer.
55
+ intermediate_size (`int`, *optional*, defaults to 2048):
56
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
57
+ projection_dim (`int`, *optional*, defaults to 512):
58
+ Dimentionality of text and vision projection layers.
59
+ num_hidden_layers (`int`, *optional*, defaults to 12):
60
+ Number of hidden layers in the Transformer encoder.
61
+ num_attention_heads (`int`, *optional*, defaults to 8):
62
+ Number of attention heads for each attention layer in the Transformer encoder.
63
+ max_position_embeddings (`int`, *optional*, defaults to 77):
64
+ The maximum sequence length that this model might ever be used with. Typically set this to something large
65
+ just in case (e.g., 512 or 1024 or 2048).
66
+ hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
67
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
68
+ `"relu"`, `"selu"` and `"gelu_new"` `"quick_gelu"` are supported.
69
+ layer_norm_eps (`float`, *optional*, defaults to 1e-05):
70
+ The epsilon used by the layer normalization layers.
71
+ attention_dropout (`float`, *optional*, defaults to 0.0):
72
+ The dropout ratio for the attention probabilities.
73
+ initializer_range (`float`, *optional*, defaults to 0.02):
74
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
75
+ initializer_factor (`float`, *optional*, defaults to 1.0):
76
+ A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
77
+ testing).
78
+ pad_token_id (`int`, *optional*, defaults to 1):
79
+ Padding token id.
80
+ bos_token_id (`int`, *optional*, defaults to 49406):
81
+ Beginning of stream token id.
82
+ eos_token_id (`int`, *optional*, defaults to 49407):
83
+ End of stream token id.
84
+
85
+ Example:
86
+
87
+ ```python
88
+ >>> from transformers import CLIPTextConfig, CLIPTextModel
89
+
90
+ >>> # Initializing a CLIPTextConfig with openai/clip-vit-base-patch32 style configuration
91
+ >>> configuration = CLIPTextConfig()
92
+
93
+ >>> # Initializing a CLIPTextModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
94
+ >>> model = CLIPTextModel(configuration)
95
+
96
+ >>> # Accessing the model configuration
97
+ >>> configuration = model.config
98
+ ```"""
99
+
100
+ model_type = "clip_text_model"
101
+
102
+ def __init__(
103
+ self,
104
+ vocab_size=49408,
105
+ hidden_size=512,
106
+ intermediate_size=2048,
107
+ projection_dim=512,
108
+ num_hidden_layers=12,
109
+ num_attention_heads=8,
110
+ max_position_embeddings=77,
111
+ hidden_act="gelu",
112
+ layer_norm_eps=1e-5,
113
+ attention_dropout=0.0,
114
+ initializer_range=0.02,
115
+ initializer_factor=1.0,
116
+ # This differs from `CLIPTokenizer`'s default and from openai/clip
117
+ # See https://github.com/huggingface/transformers/pull/24773#issuecomment-1632287538
118
+ q_bias=True,
119
+ k_bias=True,
120
+ v_bias=True,
121
+ subln=False,
122
+ swiglu=False,
123
+ rope=False,
124
+ post_layernorm=False,
125
+ pad_token_id=1,
126
+ bos_token_id=49406,
127
+ eos_token_id=49407,
128
+ **kwargs,
129
+ ):
130
+ super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)
131
+
132
+ self.vocab_size = vocab_size
133
+ self.hidden_size = hidden_size
134
+ self.intermediate_size = intermediate_size
135
+ self.projection_dim = projection_dim
136
+ self.num_hidden_layers = num_hidden_layers
137
+ self.num_attention_heads = num_attention_heads
138
+ self.max_position_embeddings = max_position_embeddings
139
+ self.layer_norm_eps = layer_norm_eps
140
+ self.hidden_act = hidden_act
141
+ self.initializer_range = initializer_range
142
+ self.initializer_factor = initializer_factor
143
+ self.q_bias=q_bias
144
+ self.k_bias=k_bias
145
+ self.v_bias=v_bias
146
+ self.subln = subln
147
+ self.swiglu = swiglu
148
+ self.rope = rope
149
+ self.post_layernorm = post_layernorm
150
+ self.attention_dropout = attention_dropout
151
+
152
+ @classmethod
153
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
154
+ cls._set_token_in_kwargs(kwargs)
155
+
156
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
157
+
158
+ # get the text config dict if we are loading from CLIPConfig
159
+ if config_dict.get("model_type") == "clip":
160
+ config_dict = config_dict["text_config"]
161
+
162
+ if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
163
+ logger.warning(
164
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
165
+ f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
166
+ )
167
+
168
+ return cls.from_dict(config_dict, **kwargs)
169
+
170
+
171
+ class QLIPVisionConfig(PretrainedConfig):
172
+ r"""
173
+ This is the configuration class to store the configuration of a [`CLIPVisionModel`]. It is used to instantiate a
174
+ CLIP vision encoder according to the specified arguments, defining the model architecture. Instantiating a
175
+ configuration with the defaults will yield a similar configuration to that of the vision encoder of the CLIP
176
+ [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
177
+
178
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
179
+ documentation from [`PretrainedConfig`] for more information.
180
+
181
+ Args:
182
+ hidden_size (`int`, *optional*, defaults to 768):
183
+ Dimensionality of the encoder layers and the pooler layer.
184
+ intermediate_size (`int`, *optional*, defaults to 3072):
185
+ Dimensionality of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
186
+ projection_dim (`int`, *optional*, defaults to 512):
187
+ Dimentionality of text and vision projection layers.
188
+ num_hidden_layers (`int`, *optional*, defaults to 12):
189
+ Number of hidden layers in the Transformer encoder.
190
+ num_attention_heads (`int`, *optional*, defaults to 12):
191
+ Number of attention heads for each attention layer in the Transformer encoder.
192
+ num_channels (`int`, *optional*, defaults to 3):
193
+ The number of input channels.
194
+ image_size (`int`, *optional*, defaults to 224):
195
+ The size (resolution) of each image.
196
+ patch_size (`int`, *optional*, defaults to 32):
197
+ The size (resolution) of each patch.
198
+ hidden_act (`str` or `function`, *optional*, defaults to `"quick_gelu"`):
199
+ The non-linear activation function (function or string) in the encoder and pooler. If string, `"gelu"`,
200
+ `"relu"`, `"selu"` and `"gelu_new"` ``"quick_gelu"` are supported.
201
+ layer_norm_eps (`float`, *optional*, defaults to 1e-05):
202
+ The epsilon used by the layer normalization layers.
203
+ attention_dropout (`float`, *optional*, defaults to 0.0):
204
+ The dropout ratio for the attention probabilities.
205
+ initializer_range (`float`, *optional*, defaults to 0.02):
206
+ The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
207
+ initializer_factor (`float`, *optional*, defaults to 1.0):
208
+ A factor for initializing all weight matrices (should be kept to 1, used internally for initialization
209
+ testing).
210
+
211
+ Example:
212
+
213
+ ```python
214
+ >>> from transformers import CLIPVisionConfig, CLIPVisionModel
215
+
216
+ >>> # Initializing a CLIPVisionConfig with openai/clip-vit-base-patch32 style configuration
217
+ >>> configuration = CLIPVisionConfig()
218
+
219
+ >>> # Initializing a CLIPVisionModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
220
+ >>> model = CLIPVisionModel(configuration)
221
+
222
+ >>> # Accessing the model configuration
223
+ >>> configuration = model.config
224
+ ```"""
225
+
226
+ model_type = "clip_vision_model"
227
+
228
+ def __init__(
229
+ self,
230
+ hidden_size=768,
231
+ intermediate_size=3072,
232
+ projection_dim=512,
233
+ num_hidden_layers=12,
234
+ num_attention_heads=12,
235
+ num_channels=3,
236
+ image_size=224,
237
+ patch_size=32,
238
+ hidden_act="gelu",
239
+ layer_norm_eps=1e-5,
240
+ attention_dropout=0.0,
241
+ initializer_range=0.02,
242
+ initializer_factor=1.0,
243
+ q_bias=True,
244
+ k_bias=True,
245
+ v_bias=True,
246
+ subln=False,
247
+ swiglu=False,
248
+ rope=False,
249
+ post_layernorm=False,
250
+ # quantizer specs
251
+ quantizer="none",
252
+ quantizer_l2_norm=False,
253
+ quantizer_embed_type="identity",
254
+ hidden_size_post_q=None,
255
+ quantizer_cfg=dict(),
256
+ **kwargs,
257
+ ):
258
+ super().__init__(**kwargs)
259
+
260
+ self.hidden_size = hidden_size
261
+ self.intermediate_size = intermediate_size
262
+ self.projection_dim = projection_dim
263
+ self.num_hidden_layers = num_hidden_layers
264
+ self.num_attention_heads = num_attention_heads
265
+ self.num_channels = num_channels
266
+ self.patch_size = patch_size
267
+ self.image_size = image_size
268
+ self.initializer_range = initializer_range
269
+ self.initializer_factor = initializer_factor
270
+ self.q_bias=q_bias
271
+ self.k_bias=k_bias
272
+ self.v_bias=v_bias
273
+ self.subln = subln
274
+ self.swiglu = swiglu
275
+ self.rope = rope
276
+ self.post_layernorm = post_layernorm
277
+ self.attention_dropout = attention_dropout
278
+ self.layer_norm_eps = layer_norm_eps
279
+ self.hidden_act = hidden_act
280
+
281
+ self.quantizer = quantizer
282
+ self.quantizer_l2_norm = quantizer_l2_norm
283
+ self.quantizer_embed_type = quantizer_embed_type
284
+ self.hidden_size_post_q = self.hidden_size if hidden_size_post_q is None else hidden_size_post_q
285
+ self.quantizer_cfg = quantizer_cfg
286
+
287
+ @classmethod
288
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
289
+ cls._set_token_in_kwargs(kwargs)
290
+
291
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
292
+
293
+ # get the vision config dict if we are loading from CLIPConfig
294
+ if config_dict.get("model_type") == "clip":
295
+ config_dict = config_dict["vision_config"]
296
+
297
+ if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
298
+ logger.warning(
299
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
300
+ f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
301
+ )
302
+
303
+ return cls.from_dict(config_dict, **kwargs)
304
+
305
+
306
+ class QLIPDecoderConfig(PretrainedConfig):
307
+ model_type = "clip_decoder_model"
308
+
309
+ def __init__(
310
+ self,
311
+ hidden_size=768,
312
+ intermediate_size=3072,
313
+ projection_dim=512,
314
+ num_hidden_layers=12,
315
+ num_attention_heads=12,
316
+ num_channels=3,
317
+ image_size=224,
318
+ patch_size=32,
319
+ hidden_act="gelu",
320
+ layer_norm_eps=1e-5,
321
+ attention_dropout=0.0,
322
+ initializer_range=0.02,
323
+ initializer_factor=1.0,
324
+ q_bias=True,
325
+ k_bias=True,
326
+ v_bias=True,
327
+ subln=False,
328
+ swiglu=False,
329
+ rope=False,
330
+ post_layernorm=False,
331
+ # quantizer specs
332
+ quantizer="none",
333
+ quantizer_l2_norm=False,
334
+ quantizer_embed_type="identity",
335
+ hidden_size_post_q=None,
336
+ quantizer_cfg=dict(),
337
+ **kwargs,
338
+ ):
339
+ super().__init__(**kwargs)
340
+
341
+ self.hidden_size = hidden_size
342
+ self.intermediate_size = intermediate_size
343
+ self.projection_dim = projection_dim
344
+ self.num_hidden_layers = num_hidden_layers
345
+ self.num_attention_heads = num_attention_heads
346
+ self.num_channels = num_channels
347
+ self.patch_size = patch_size
348
+ self.image_size = image_size
349
+ self.initializer_range = initializer_range
350
+ self.initializer_factor = initializer_factor
351
+ self.q_bias=q_bias
352
+ self.k_bias=k_bias
353
+ self.v_bias=v_bias
354
+ self.subln = subln
355
+ self.swiglu = swiglu
356
+ self.rope = rope
357
+ self.post_layernorm = post_layernorm
358
+ self.attention_dropout = attention_dropout
359
+ self.layer_norm_eps = layer_norm_eps
360
+ self.hidden_act = hidden_act
361
+
362
+ self.quantizer = quantizer
363
+ self.quantizer_l2_norm = quantizer_l2_norm
364
+ self.quantizer_embed_type = quantizer_embed_type
365
+ self.hidden_size_post_q = self.hidden_size if hidden_size_post_q is None else hidden_size_post_q
366
+ self.quantizer_cfg = quantizer_cfg
367
+
368
+ @classmethod
369
+ def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
370
+ cls._set_token_in_kwargs(kwargs)
371
+
372
+ config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
373
+
374
+ # get the vision config dict if we are loading from CLIPConfig
375
+ if config_dict.get("model_type") == "clip":
376
+ config_dict = config_dict["vision_config"]
377
+
378
+ if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
379
+ logger.warning(
380
+ f"You are using a model of type {config_dict['model_type']} to instantiate a model of type "
381
+ f"{cls.model_type}. This is not supported for all configurations of models and can yield errors."
382
+ )
383
+
384
+ return cls.from_dict(config_dict, **kwargs)
385
+
386
+
387
+ class QLIPConfig(PretrainedConfig):
388
+ r"""
389
+ [`CLIPConfig`] is the configuration class to store the configuration of a [`CLIPModel`]. It is used to instantiate
390
+ a CLIP model according to the specified arguments, defining the text model and vision model configs. Instantiating
391
+ a configuration with the defaults will yield a similar configuration to that of the CLIP
392
+ [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) architecture.
393
+
394
+ Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
395
+ documentation from [`PretrainedConfig`] for more information.
396
+
397
+ Args:
398
+ text_config (`dict`, *optional*):
399
+ Dictionary of configuration options used to initialize [`CLIPTextConfig`].
400
+ vision_config (`dict`, *optional*):
401
+ Dictionary of configuration options used to initialize [`CLIPVisionConfig`].
402
+ projection_dim (`int`, *optional*, defaults to 512):
403
+ Dimentionality of text and vision projection layers.
404
+ logit_scale_init_value (`float`, *optional*, defaults to 2.6592):
405
+ The inital value of the *logit_scale* paramter. Default is used as per the original CLIP implementation.
406
+ kwargs (*optional*):
407
+ Dictionary of keyword arguments.
408
+
409
+ Example:
410
+
411
+ ```python
412
+ >>> from transformers import CLIPConfig, CLIPModel
413
+
414
+ >>> # Initializing a CLIPConfig with openai/clip-vit-base-patch32 style configuration
415
+ >>> configuration = CLIPConfig()
416
+
417
+ >>> # Initializing a CLIPModel (with random weights) from the openai/clip-vit-base-patch32 style configuration
418
+ >>> model = CLIPModel(configuration)
419
+
420
+ >>> # Accessing the model configuration
421
+ >>> configuration = model.config
422
+
423
+ >>> # We can also initialize a CLIPConfig from a CLIPTextConfig and a CLIPVisionConfig
424
+ >>> from transformers import CLIPTextConfig, CLIPVisionConfig
425
+
426
+ >>> # Initializing a CLIPText and CLIPVision configuration
427
+ >>> config_text = CLIPTextConfig()
428
+ >>> config_vision = CLIPVisionConfig()
429
+
430
+ >>> config = CLIPConfig.from_text_vision_configs(config_text, config_vision)
431
+ ```"""
432
+
433
+ model_type = "clip"
434
+
435
+ def __init__(
436
+ self, text_config=None, vision_config=None, decoder_config=None, projection_dim=512, logit_scale_init_value=2.6592, **kwargs
437
+ ):
438
+ # If `_config_dict` exist, we use them for the backward compatibility.
439
+ # We pop out these 2 attributes before calling `super().__init__` to avoid them being saved (which causes a lot
440
+ # of confusion!).
441
+ text_config_dict = kwargs.pop("text_config_dict", None)
442
+ vision_config_dict = kwargs.pop("vision_config_dict", None)
443
+ decoder_config_dict = kwargs.pop("decoder_config_dict", None)
444
+
445
+ super().__init__(**kwargs)
446
+
447
+ # Instead of simply assigning `[text|vision]_config_dict` to `[text|vision]_config`, we use the values in
448
+ # `[text|vision]_config_dict` to update the values in `[text|vision]_config`. The values should be same in most
449
+ # cases, but we don't want to break anything regarding `_config_dict` that existed before commit `8827e1b2`.
450
+ if text_config_dict is not None:
451
+ if text_config is None:
452
+ text_config = {}
453
+
454
+ # This is the complete result when using `text_config_dict`.
455
+ _text_config_dict = QLIPTextConfig(**text_config_dict).to_dict()
456
+
457
+ # Give a warning if the values exist in both `_text_config_dict` and `text_config` but being different.
458
+ for key, value in _text_config_dict.items():
459
+ if key in text_config and value != text_config[key] and key not in ["transformers_version"]:
460
+ # If specified in `text_config_dict`
461
+ if key in text_config_dict:
462
+ message = (
463
+ f"`{key}` is found in both `text_config_dict` and `text_config` but with different values. "
464
+ f'The value `text_config_dict["{key}"]` will be used instead.'
465
+ )
466
+ # If inferred from default argument values (just to be super careful)
467
+ else:
468
+ message = (
469
+ f"`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The "
470
+ f'value `text_config["{key}"]` will be overriden.'
471
+ )
472
+ logger.info(message)
473
+
474
+ # Update all values in `text_config` with the ones in `_text_config_dict`.
475
+ text_config.update(_text_config_dict)
476
+
477
+ if vision_config_dict is not None:
478
+ if vision_config is None:
479
+ vision_config = {}
480
+
481
+ # This is the complete result when using `vision_config_dict`.
482
+ _vision_config_dict = QLIPVisionConfig(**vision_config_dict).to_dict()
483
+ # convert keys to string instead of integer
484
+ if "id2label" in _vision_config_dict:
485
+ _vision_config_dict["id2label"] = {
486
+ str(key): value for key, value in _vision_config_dict["id2label"].items()
487
+ }
488
+
489
+ # Give a warning if the values exist in both `_vision_config_dict` and `vision_config` but being different.
490
+ for key, value in _vision_config_dict.items():
491
+ if key in vision_config and value != vision_config[key] and key not in ["transformers_version"]:
492
+ # If specified in `vision_config_dict`
493
+ if key in vision_config_dict:
494
+ message = (
495
+ f"`{key}` is found in both `vision_config_dict` and `vision_config` but with different "
496
+ f'values. The value `vision_config_dict["{key}"]` will be used instead.'
497
+ )
498
+ # If inferred from default argument values (just to be super careful)
499
+ else:
500
+ message = (
501
+ f"`vision_config_dict` is provided which will be used to initialize `CLIPVisionConfig`. "
502
+ f'The value `vision_config["{key}"]` will be overriden.'
503
+ )
504
+ logger.info(message)
505
+
506
+ # Update all values in `vision_config` with the ones in `_vision_config_dict`.
507
+ vision_config.update(_vision_config_dict)
508
+
509
+ if decoder_config_dict is not None:
510
+ if decoder_config is None:
511
+ decoder_config = {}
512
+
513
+ # This is the complete result when using `decoder_config_dict`.
514
+ _decoder_config_dict = QLIPDecoderConfig(**decoder_config_dict).to_dict()
515
+
516
+ # Give a warning if the values exist in both `_decoder_config_dict` and `decoder_config` but being different.
517
+ for key, value in _decoder_config_dict.items():
518
+ if key in decoder_config and value != decoder_config[key] and key not in ["transformers_version"]:
519
+ # If specified in `decoder_config_dict`
520
+ if key in decoder_config_dict:
521
+ message = (
522
+ f"`{key}` is found in both `decoder_config_dict` and `decoder_config` but with different values. "
523
+ f'The value `decoder_config_dict["{key}"]` will be used instead.'
524
+ )
525
+ # If inferred from default argument values (just to be super careful)
526
+ else:
527
+ message = (
528
+ f"`decoder_config_dict` is provided which will be used to initialize `QLIPDecoderConfig`. The "
529
+ f'value `decoder_config["{key}"]` will be overriden.'
530
+ )
531
+ logger.info(message)
532
+
533
+ # Update all values in `decoder_config` with the ones in `_decoder_config_dict`.
534
+ decoder_config.update(_decoder_config_dict)
535
+
536
+ if text_config is None:
537
+ text_config = {}
538
+ logger.info("`text_config` is `None`. Initializing the `CLIPTextConfig` with default values.")
539
+
540
+ if vision_config is None:
541
+ vision_config = {}
542
+ logger.info("`vision_config` is `None`. initializing the `CLIPVisionConfig` with default values.")
543
+
544
+ if decoder_config is None:
545
+ decoder_config = {}
546
+ logger.info("`decoder_config` is `None`. initializing the `CLIPDecoderConfig` with default values.")
547
+
548
+ self.text_config = QLIPTextConfig(**text_config)
549
+ self.vision_config = QLIPVisionConfig(**vision_config)
550
+ self.decoder_config = QLIPDecoderConfig(**decoder_config)
551
+
552
+ self.projection_dim = projection_dim
553
+ self.logit_scale_init_value = logit_scale_init_value
554
+ self.initializer_factor = 1.0
555
+
556
+ @classmethod
557
+ def from_text_vision_configs(cls, text_config: QLIPTextConfig, vision_config: QLIPVisionConfig, decoder_config: QLIPDecoderConfig, **kwargs):
558
+ r"""
559
+ Instantiate a [`CLIPConfig`] (or a derived class) from clip text model configuration and clip vision model
560
+ configuration.
561
+
562
+ Returns:
563
+ [`CLIPConfig`]: An instance of a configuration object
564
+ """
565
+
566
+ return cls(text_config=text_config.to_dict(), vision_config=vision_config.to_dict(), decoder_config=decoder_config.to_dict(), **kwargs)
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:edb2e10b0fdab7aab3a4019a019fec980b963685cf78d0c1e21179b58346cf9e
3
+ size 2953399476
modeling_qlip.py ADDED
@@ -0,0 +1,1482 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2024, NVIDIA Corporation & Affiliates. All rights reserved.
2
+ #
3
+ # This work is made available under the Nvidia Source Code License-NC.
4
+ # To view a copy of this license, visit
5
+ # https://github.com/NVlabs/QLIP/blob/main/LICENSE
6
+
7
+ # coding=utf-8
8
+ # Copyright 2021 The OpenAI Team Authors and The HuggingFace Team. All rights reserved.
9
+ #
10
+ # Licensed under the Apache License, Version 2.0 (the "License");
11
+ # you may not use this file except in compliance with the License.
12
+ # You may obtain a copy of the License at
13
+ #
14
+ # http://www.apache.org/licenses/LICENSE-2.0
15
+ #
16
+ # Unless required by applicable law or agreed to in writing, software
17
+ # distributed under the License is distributed on an "AS IS" BASIS,
18
+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
19
+ # See the License for the specific language governing permissions and
20
+ # limitations under the License.
21
+ """ PyTorch CLIP model."""
22
+
23
+
24
+ from collections import OrderedDict
25
+ from dataclasses import dataclass
26
+ from typing import Any, Optional, Tuple, Union
27
+
28
+ from einops import rearrange
29
+ import torch
30
+ import torch.utils.checkpoint
31
+ from torch import nn
32
+ import torch.nn.functional as F
33
+
34
+ from transformers.activations import ACT2FN
35
+ from transformers.modeling_attn_mask_utils import _create_4d_causal_attention_mask, _prepare_4d_attention_mask
36
+ from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling
37
+ from transformers.modeling_utils import PreTrainedModel
38
+ from transformers.utils import (
39
+ ModelOutput,
40
+ add_start_docstrings,
41
+ add_start_docstrings_to_model_forward,
42
+ logging,
43
+ replace_return_docstrings,
44
+ )
45
+
46
+ from configuration_qlip import QLIPConfig, QLIPTextConfig, QLIPVisionConfig, QLIPDecoderConfig
47
+ from bsq import BinarySphericalQuantizer
48
+ from rope import VisionRotaryEmbeddingFast
49
+
50
+
51
+ logger = logging.get_logger(__name__)
52
+
53
+ _CHECKPOINT_FOR_DOC = "openai/clip-vit-base-patch32"
54
+
55
+ CLIP_PRETRAINED_MODEL_ARCHIVE_LIST = [
56
+ "openai/clip-vit-base-patch32",
57
+ # See all CLIP models at https://huggingface.co/models?filter=clip
58
+ ]
59
+
60
+
61
+ # contrastive loss function, adapted from
62
+ # https://sachinruk.github.io/blog/2021-03-07-clip.html
63
+ def contrastive_loss(logits: torch.Tensor) -> torch.Tensor:
64
+ return nn.functional.cross_entropy(logits, torch.arange(len(logits), device=logits.device))
65
+
66
+
67
+ def clip_loss(similarity: torch.Tensor) -> torch.Tensor:
68
+ caption_loss = contrastive_loss(similarity)
69
+ image_loss = contrastive_loss(similarity.t())
70
+ return (caption_loss + image_loss) / 2.0
71
+
72
+
73
+ @dataclass
74
+ class QLIPVisionModelOutput(ModelOutput):
75
+ """
76
+ Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
77
+
78
+ Args:
79
+ image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
80
+ The image embeddings obtained by applying the projection layer to the pooler_output.
81
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
82
+ Sequence of hidden-states at the output of the last layer of the model.
83
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
84
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
85
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
86
+
87
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
88
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
89
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
90
+ sequence_length)`.
91
+
92
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
93
+ heads.
94
+ """
95
+
96
+ image_embeds: Optional[torch.FloatTensor] = None
97
+ last_hidden_state: torch.FloatTensor = None
98
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
99
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
100
+
101
+
102
+ @dataclass
103
+ class QLIPTextModelOutput(ModelOutput):
104
+ """
105
+ Base class for text model's outputs that also contains a pooling of the last hidden states.
106
+
107
+ Args:
108
+ text_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
109
+ The text embeddings obtained by applying the projection layer to the pooler_output.
110
+ last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
111
+ Sequence of hidden-states at the output of the last layer of the model.
112
+ hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
113
+ Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
114
+ one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
115
+
116
+ Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
117
+ attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
118
+ Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
119
+ sequence_length)`.
120
+
121
+ Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
122
+ heads.
123
+ """
124
+
125
+ text_embeds: Optional[torch.FloatTensor] = None
126
+ last_hidden_state: torch.FloatTensor = None
127
+ hidden_states: Optional[Tuple[torch.FloatTensor]] = None
128
+ attentions: Optional[Tuple[torch.FloatTensor]] = None
129
+
130
+
131
+ @dataclass
132
+ class QLIPOutput(ModelOutput):
133
+ """
134
+ Args:
135
+ loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `return_loss` is `True`):
136
+ Contrastive loss for image-text similarity.
137
+ logits_per_image:(`torch.FloatTensor` of shape `(image_batch_size, text_batch_size)`):
138
+ The scaled dot product scores between `image_embeds` and `text_embeds`. This represents the image-text
139
+ similarity scores.
140
+ logits_per_text:(`torch.FloatTensor` of shape `(text_batch_size, image_batch_size)`):
141
+ The scaled dot product scores between `text_embeds` and `image_embeds`. This represents the text-image
142
+ similarity scores.
143
+ text_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
144
+ The text embeddings obtained by applying the projection layer to the pooled output of [`CLIPTextModel`].
145
+ image_embeds(`torch.FloatTensor` of shape `(batch_size, output_dim`):
146
+ The image embeddings obtained by applying the projection layer to the pooled output of [`CLIPVisionModel`].
147
+ text_model_output(`BaseModelOutputWithPooling`):
148
+ The output of the [`CLIPTextModel`].
149
+ vision_model_output(`BaseModelOutputWithPooling`):
150
+ The output of the [`CLIPVisionModel`].
151
+ """
152
+
153
+ loss: Optional[torch.FloatTensor] = None
154
+ logits_per_image: torch.FloatTensor = None
155
+ logits_per_text: torch.FloatTensor = None
156
+ text_embeds: torch.FloatTensor = None
157
+ image_embeds: torch.FloatTensor = None
158
+ text_model_output: BaseModelOutputWithPooling = None
159
+ vision_model_output: BaseModelOutputWithPooling = None
160
+ reconstructions: torch.FloatTensor = None
161
+
162
+ def to_tuple(self) -> Tuple[Any]:
163
+ return tuple(
164
+ self[k] if k not in ["text_model_output", "vision_model_output"] else getattr(self, k).to_tuple()
165
+ for k in self.keys()
166
+ )
167
+
168
+
169
+ class QLIPVisionEmbeddings(nn.Module):
170
+ def __init__(self, config: QLIPVisionConfig):
171
+ super().__init__()
172
+ self.config = config
173
+ self.embed_dim = config.hidden_size
174
+ self.image_size = config.image_size
175
+ self.patch_size = config.patch_size
176
+
177
+ self.class_embedding = nn.Parameter(torch.randn(self.embed_dim))
178
+
179
+ self.patch_embedding = nn.Conv2d(
180
+ in_channels=config.num_channels,
181
+ out_channels=self.embed_dim,
182
+ kernel_size=self.patch_size,
183
+ stride=self.patch_size,
184
+ bias=True,
185
+ )
186
+
187
+ self.num_patches = (self.image_size // self.patch_size) ** 2
188
+ self.num_positions = self.num_patches + 1
189
+ self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
190
+ self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)
191
+
192
+ def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
193
+ batch_size = pixel_values.shape[0]
194
+ target_dtype = self.patch_embedding.weight.dtype
195
+ patch_embeds = self.patch_embedding(pixel_values.to(dtype=target_dtype)) # shape = [*, width, grid, grid]
196
+ patch_embeds = patch_embeds.flatten(2).transpose(1, 2)
197
+
198
+ class_embeds = self.class_embedding.expand(batch_size, 1, -1)
199
+ embeddings = torch.cat([class_embeds, patch_embeds], dim=1)
200
+ embeddings = embeddings + self.position_embedding(self.position_ids)
201
+ return embeddings
202
+
203
+
204
+ class QLIPTextEmbeddings(nn.Module):
205
+ def __init__(self, config: QLIPTextConfig):
206
+ super().__init__()
207
+ embed_dim = config.hidden_size
208
+
209
+ self.token_embedding = nn.Embedding(config.vocab_size, embed_dim)
210
+ self.position_embedding = nn.Embedding(config.max_position_embeddings, embed_dim)
211
+
212
+ # position_ids (1, len position emb) is contiguous in memory and exported when serialized
213
+ self.register_buffer(
214
+ "position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)), persistent=False
215
+ )
216
+
217
+ def forward(
218
+ self,
219
+ input_ids: Optional[torch.LongTensor] = None,
220
+ position_ids: Optional[torch.LongTensor] = None,
221
+ inputs_embeds: Optional[torch.FloatTensor] = None,
222
+ ) -> torch.Tensor:
223
+ seq_length = input_ids.shape[-1] if input_ids is not None else inputs_embeds.shape[-2]
224
+
225
+ if position_ids is None:
226
+ position_ids = self.position_ids[:, :seq_length]
227
+
228
+ if inputs_embeds is None:
229
+ inputs_embeds = self.token_embedding(input_ids)
230
+
231
+ position_embeddings = self.position_embedding(position_ids)
232
+ embeddings = inputs_embeds + position_embeddings
233
+
234
+ return embeddings
235
+
236
+
237
+ class QLIPAttention(nn.Module):
238
+ """Multi-headed attention from 'Attention Is All You Need' paper"""
239
+
240
+ def __init__(self, config, rope=None, rope_shift=1):
241
+ super().__init__()
242
+ self.config = config
243
+ self.embed_dim = config.hidden_size
244
+ self.num_heads = config.num_attention_heads
245
+ self.head_dim = self.embed_dim // self.num_heads
246
+ if self.head_dim * self.num_heads != self.embed_dim:
247
+ raise ValueError(
248
+ f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:"
249
+ f" {self.num_heads})."
250
+ )
251
+ self.scale = self.head_dim**-0.5
252
+ self.dropout = config.attention_dropout
253
+
254
+ self.subln = config.subln
255
+ self.k_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.k_bias)
256
+ self.v_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.v_bias)
257
+ self.q_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=config.q_bias)
258
+ self.inner_attn_ln = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps) if config.subln else nn.Identity()
259
+ self.out_proj = nn.Linear(self.embed_dim, self.embed_dim, bias=True)
260
+
261
+ self.rope = rope
262
+ self.rope_shift = rope_shift
263
+
264
+ def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
265
+ return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
266
+
267
+ def forward(
268
+ self,
269
+ hidden_states: torch.Tensor,
270
+ attention_mask: Optional[torch.Tensor] = None,
271
+ causal_attention_mask: Optional[torch.Tensor] = None,
272
+ output_attentions: Optional[bool] = False,
273
+ ) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
274
+ """Input shape: Batch x Time x Channel"""
275
+
276
+ bsz, tgt_len, embed_dim = hidden_states.size()
277
+
278
+ # get query proj
279
+ query_states = self.q_proj(hidden_states) * self.scale
280
+ key_states = self._shape(self.k_proj(hidden_states), -1, bsz)
281
+ value_states = self._shape(self.v_proj(hidden_states), -1, bsz)
282
+
283
+ proj_shape = (bsz * self.num_heads, -1, self.head_dim)
284
+ query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
285
+ key_states = key_states.view(*proj_shape)
286
+ value_states = value_states.view(*proj_shape)
287
+
288
+ if self.rope:
289
+ q_t = query_states[:, self.rope_shift:, :]
290
+ ro_q_t = self.rope(q_t)
291
+ query_states = torch.cat([query_states[:, :self.rope_shift, :], ro_q_t], dim=-2).type_as(value_states)
292
+
293
+ k_t = key_states[:, self.rope_shift:, :]
294
+ ro_k_t = self.rope(k_t)
295
+ key_states = torch.cat([key_states[:, :self.rope_shift, :], ro_k_t], dim=-2).type_as(value_states)
296
+
297
+ src_len = key_states.size(1)
298
+ attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
299
+
300
+ if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
301
+ raise ValueError(
302
+ f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
303
+ f" {attn_weights.size()}"
304
+ )
305
+
306
+ # apply the causal_attention_mask first
307
+ if causal_attention_mask is not None:
308
+ if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
309
+ raise ValueError(
310
+ f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is"
311
+ f" {causal_attention_mask.size()}"
312
+ )
313
+ attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + causal_attention_mask
314
+ attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
315
+
316
+ if attention_mask is not None:
317
+ if attention_mask.size() != (bsz, 1, tgt_len, src_len):
318
+ raise ValueError(
319
+ f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
320
+ )
321
+ attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
322
+ attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
323
+
324
+ attn_weights = nn.functional.softmax(attn_weights, dim=-1)
325
+
326
+ if output_attentions:
327
+ # this operation is a bit akward, but it's required to
328
+ # make sure that attn_weights keeps its gradient.
329
+ # In order to do so, attn_weights have to reshaped
330
+ # twice and have to be reused in the following
331
+ attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
332
+ attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len)
333
+ else:
334
+ attn_weights_reshaped = None
335
+
336
+ attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
337
+
338
+ attn_output = torch.bmm(attn_probs, value_states)
339
+
340
+ if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
341
+ raise ValueError(
342
+ f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is"
343
+ f" {attn_output.size()}"
344
+ )
345
+
346
+ attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
347
+ attn_output = attn_output.transpose(1, 2)
348
+ attn_output = attn_output.reshape(bsz, tgt_len, embed_dim)
349
+
350
+ attn_output = self.inner_attn_ln(attn_output)
351
+ attn_output = self.out_proj(attn_output)
352
+
353
+ return attn_output, attn_weights_reshaped
354
+
355
+
356
+ class QLIPSwiGLU(nn.Module):
357
+ def __init__(self, config):
358
+ super().__init__()
359
+ self.config = config
360
+ self.hidden_size = config.hidden_size
361
+ self.intermediate_size = config.intermediate_size
362
+ self.w1 = nn.Linear(self.hidden_size, self.intermediate_size)
363
+ self.w2 = nn.Linear(self.hidden_size, self.intermediate_size)
364
+ self.w3 = nn.Linear(self.intermediate_size, self.hidden_size)
365
+ self.act_fn = nn.SiLU()
366
+ self.ffn_ln = nn.LayerNorm(self.intermediate_size, eps=config.layer_norm_eps) if config.subln else nn.Identity()
367
+
368
+ def forward(self, x):
369
+ x1 = self.w1(x)
370
+ x2 = self.w2(x)
371
+ hidden = self.act_fn(x1) * x2
372
+ x = self.ffn_ln(hidden)
373
+ x = self.w3(x)
374
+ return x
375
+
376
+
377
+ class QLIPMLP(nn.Module):
378
+ def __init__(self, config):
379
+ super().__init__()
380
+ self.config = config
381
+ self.activation_fn = ACT2FN[config.hidden_act]
382
+ self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
383
+ self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
384
+ self.ffn_ln = nn.LayerNorm(config.intermediate_size, eps=config.layer_norm_eps) if config.subln else nn.Identity()
385
+
386
+ def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
387
+ hidden_states = self.fc1(hidden_states)
388
+ hidden_states = self.activation_fn(hidden_states)
389
+ hidden_states = self.ffn_ln(hidden_states)
390
+ hidden_states = self.fc2(hidden_states)
391
+ return hidden_states
392
+
393
+
394
+ class QLIPEncoderLayer(nn.Module):
395
+ def __init__(self, config: QLIPConfig, rope=None, rope_shift=1):
396
+ super().__init__()
397
+ self.embed_dim = config.hidden_size
398
+ self.self_attn = QLIPAttention(config, rope=rope, rope_shift=rope_shift)
399
+ self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
400
+ self.mlp = QLIPSwiGLU(config) if config.swiglu else QLIPMLP(config)
401
+ self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
402
+
403
+ def forward(
404
+ self,
405
+ hidden_states: torch.Tensor,
406
+ attention_mask: torch.Tensor,
407
+ causal_attention_mask: torch.Tensor,
408
+ output_attentions: Optional[bool] = False,
409
+ ) -> Tuple[torch.FloatTensor]:
410
+ """
411
+ Args:
412
+ hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
413
+ attention_mask (`torch.FloatTensor`): attention mask of size
414
+ `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
415
+ `(config.encoder_attention_heads,)`.
416
+ output_attentions (`bool`, *optional*):
417
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
418
+ returned tensors for more detail.
419
+ """
420
+ residual = hidden_states
421
+
422
+ hidden_states = self.layer_norm1(hidden_states)
423
+ hidden_states, attn_weights = self.self_attn(
424
+ hidden_states=hidden_states,
425
+ attention_mask=attention_mask,
426
+ causal_attention_mask=causal_attention_mask,
427
+ output_attentions=output_attentions,
428
+ )
429
+ hidden_states = residual + hidden_states
430
+
431
+ residual = hidden_states
432
+ hidden_states = self.layer_norm2(hidden_states)
433
+ hidden_states = self.mlp(hidden_states)
434
+ hidden_states = residual + hidden_states
435
+
436
+ outputs = (hidden_states,)
437
+
438
+ if output_attentions:
439
+ outputs += (attn_weights,)
440
+
441
+ return outputs
442
+
443
+
444
+ class QLIPPreTrainedModel(PreTrainedModel):
445
+ """
446
+ An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
447
+ models.
448
+ """
449
+
450
+ config_class = QLIPConfig
451
+ base_model_prefix = "clip"
452
+ supports_gradient_checkpointing = True
453
+
454
+ def _init_weights(self, module):
455
+ """Initialize the weights"""
456
+ factor = self.config.initializer_factor
457
+ if isinstance(module, QLIPTextEmbeddings):
458
+ module.token_embedding.weight.data.normal_(mean=0.0, std=factor * 0.02)
459
+ module.position_embedding.weight.data.normal_(mean=0.0, std=factor * 0.02)
460
+ elif isinstance(module, QLIPVisionEmbeddings):
461
+ factor = self.config.initializer_factor
462
+ nn.init.normal_(module.class_embedding, mean=0.0, std=module.embed_dim**-0.5 * factor)
463
+ nn.init.normal_(module.patch_embedding.weight, std=module.config.initializer_range * factor)
464
+ nn.init.normal_(module.position_embedding.weight, std=module.config.initializer_range * factor)
465
+ elif isinstance(module, QLIPAttention):
466
+ factor = self.config.initializer_factor
467
+ in_proj_std = (module.embed_dim**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor
468
+ out_proj_std = (module.embed_dim**-0.5) * factor
469
+ nn.init.normal_(module.q_proj.weight, std=in_proj_std)
470
+ nn.init.normal_(module.k_proj.weight, std=in_proj_std)
471
+ nn.init.normal_(module.v_proj.weight, std=in_proj_std)
472
+ nn.init.normal_(module.out_proj.weight, std=out_proj_std)
473
+ elif isinstance(module, QLIPMLP):
474
+ factor = self.config.initializer_factor
475
+ in_proj_std = (module.config.hidden_size**-0.5) * ((2 * module.config.num_hidden_layers) ** -0.5) * factor
476
+ fc_std = (2 * module.config.hidden_size) ** -0.5 * factor
477
+ nn.init.normal_(module.fc1.weight, std=fc_std)
478
+ nn.init.normal_(module.fc2.weight, std=in_proj_std)
479
+ elif isinstance(module, QLIPModel):
480
+ nn.init.normal_(
481
+ module.text_projection.weight,
482
+ std=module.text_embed_dim**-0.5 * self.config.initializer_factor,
483
+ )
484
+ nn.init.normal_(
485
+ module.visual_projection.weight,
486
+ std=module.vision_embed_dim**-0.5 * self.config.initializer_factor,
487
+ )
488
+ elif isinstance(module, QLIPVisionModelWithProjection):
489
+ nn.init.normal_(
490
+ module.visual_projection.weight,
491
+ std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
492
+ )
493
+ elif isinstance(module, QLIPTextModelWithProjection):
494
+ nn.init.normal_(
495
+ module.text_projection.weight,
496
+ std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
497
+ )
498
+
499
+ if isinstance(module, nn.LayerNorm):
500
+ module.bias.data.zero_()
501
+ module.weight.data.fill_(1.0)
502
+ if isinstance(module, nn.Linear) and module.bias is not None:
503
+ module.bias.data.zero_()
504
+
505
+
506
+ CLIP_START_DOCSTRING = r"""
507
+ This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
508
+ library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
509
+ etc.)
510
+
511
+ This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
512
+ Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
513
+ and behavior.
514
+
515
+ Parameters:
516
+ config ([`CLIPConfig`]): Model configuration class with all the parameters of the model.
517
+ Initializing with a config file does not load the weights associated with the model, only the
518
+ configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
519
+ """
520
+
521
+ CLIP_TEXT_INPUTS_DOCSTRING = r"""
522
+ Args:
523
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
524
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
525
+ it.
526
+
527
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
528
+ [`PreTrainedTokenizer.__call__`] for details.
529
+
530
+ [What are input IDs?](../glossary#input-ids)
531
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
532
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
533
+
534
+ - 1 for tokens that are **not masked**,
535
+ - 0 for tokens that are **masked**.
536
+
537
+ [What are attention masks?](../glossary#attention-mask)
538
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
539
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
540
+ config.max_position_embeddings - 1]`.
541
+
542
+ [What are position IDs?](../glossary#position-ids)
543
+ output_attentions (`bool`, *optional*):
544
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
545
+ tensors for more detail.
546
+ output_hidden_states (`bool`, *optional*):
547
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
548
+ more detail.
549
+ return_dict (`bool`, *optional*):
550
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
551
+ """
552
+
553
+ CLIP_VISION_INPUTS_DOCSTRING = r"""
554
+ Args:
555
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
556
+ Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
557
+ [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
558
+ output_attentions (`bool`, *optional*):
559
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
560
+ tensors for more detail.
561
+ output_hidden_states (`bool`, *optional*):
562
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
563
+ more detail.
564
+ return_dict (`bool`, *optional*):
565
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
566
+ """
567
+
568
+ CLIP_INPUTS_DOCSTRING = r"""
569
+ Args:
570
+ input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
571
+ Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
572
+ it.
573
+
574
+ Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
575
+ [`PreTrainedTokenizer.__call__`] for details.
576
+
577
+ [What are input IDs?](../glossary#input-ids)
578
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
579
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
580
+
581
+ - 1 for tokens that are **not masked**,
582
+ - 0 for tokens that are **masked**.
583
+
584
+ [What are attention masks?](../glossary#attention-mask)
585
+ position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
586
+ Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
587
+ config.max_position_embeddings - 1]`.
588
+
589
+ [What are position IDs?](../glossary#position-ids)
590
+ pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)`):
591
+ Pixel values. Padding will be ignored by default should you provide it. Pixel values can be obtained using
592
+ [`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details.
593
+ return_loss (`bool`, *optional*):
594
+ Whether or not to return the contrastive loss.
595
+ output_attentions (`bool`, *optional*):
596
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
597
+ tensors for more detail.
598
+ output_hidden_states (`bool`, *optional*):
599
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
600
+ more detail.
601
+ return_dict (`bool`, *optional*):
602
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
603
+ """
604
+
605
+
606
+ class QLIPEncoder(nn.Module):
607
+ """
608
+ Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
609
+ [`CLIPEncoderLayer`].
610
+
611
+ Args:
612
+ config: CLIPConfig
613
+ """
614
+
615
+ def __init__(self, config: QLIPConfig, rope=None, rope_shift=1):
616
+ super().__init__()
617
+ self.config = config
618
+ self.layers = nn.ModuleList([
619
+ QLIPEncoderLayer(config, rope=rope, rope_shift=rope_shift)
620
+ for _ in range(config.num_hidden_layers)
621
+ ])
622
+ self.gradient_checkpointing = False
623
+
624
+ def forward(
625
+ self,
626
+ inputs_embeds,
627
+ attention_mask: Optional[torch.Tensor] = None,
628
+ causal_attention_mask: Optional[torch.Tensor] = None,
629
+ output_attentions: Optional[bool] = None,
630
+ output_hidden_states: Optional[bool] = None,
631
+ return_dict: Optional[bool] = None,
632
+ ) -> Union[Tuple, BaseModelOutput]:
633
+ r"""
634
+ Args:
635
+ inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
636
+ Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
637
+ This is useful if you want more control over how to convert `input_ids` indices into associated vectors
638
+ than the model's internal embedding lookup matrix.
639
+ attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
640
+ Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
641
+
642
+ - 1 for tokens that are **not masked**,
643
+ - 0 for tokens that are **masked**.
644
+
645
+ [What are attention masks?](../glossary#attention-mask)
646
+ causal_attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
647
+ Causal mask for the text model. Mask values selected in `[0, 1]`:
648
+
649
+ - 1 for tokens that are **not masked**,
650
+ - 0 for tokens that are **masked**.
651
+
652
+ [What are attention masks?](../glossary#attention-mask)
653
+ output_attentions (`bool`, *optional*):
654
+ Whether or not to return the attentions tensors of all attention layers. See `attentions` under
655
+ returned tensors for more detail.
656
+ output_hidden_states (`bool`, *optional*):
657
+ Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
658
+ for more detail.
659
+ return_dict (`bool`, *optional*):
660
+ Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
661
+ """
662
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
663
+ output_hidden_states = (
664
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
665
+ )
666
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
667
+
668
+ encoder_states = () if output_hidden_states else None
669
+ all_attentions = () if output_attentions else None
670
+
671
+ hidden_states = inputs_embeds
672
+ for idx, encoder_layer in enumerate(self.layers):
673
+ if output_hidden_states:
674
+ encoder_states = encoder_states + (hidden_states,)
675
+ if self.gradient_checkpointing and self.training:
676
+ layer_outputs = self._gradient_checkpointing_func(
677
+ encoder_layer.__call__,
678
+ hidden_states,
679
+ attention_mask,
680
+ causal_attention_mask,
681
+ output_attentions,
682
+ )
683
+ else:
684
+ layer_outputs = encoder_layer(
685
+ hidden_states,
686
+ attention_mask,
687
+ causal_attention_mask,
688
+ output_attentions=output_attentions,
689
+ )
690
+
691
+ hidden_states = layer_outputs[0]
692
+
693
+ if output_attentions:
694
+ all_attentions = all_attentions + (layer_outputs[1],)
695
+
696
+ if output_hidden_states:
697
+ encoder_states = encoder_states + (hidden_states,)
698
+
699
+ if not return_dict:
700
+ return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
701
+ return BaseModelOutput(
702
+ last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions
703
+ )
704
+
705
+
706
+ class QLIPTextTransformer(nn.Module):
707
+ def __init__(self, config: QLIPTextConfig):
708
+ super().__init__()
709
+ self.config = config
710
+ embed_dim = config.hidden_size
711
+ self.embeddings = QLIPTextEmbeddings(config)
712
+ self.encoder = QLIPEncoder(config)
713
+ self.final_layer_norm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
714
+
715
+ # For `pooled_output` computation
716
+ self.eos_token_id = config.eos_token_id
717
+
718
+ @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING)
719
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=QLIPTextConfig)
720
+ def forward(
721
+ self,
722
+ input_ids: Optional[torch.Tensor] = None,
723
+ attention_mask: Optional[torch.Tensor] = None,
724
+ position_ids: Optional[torch.Tensor] = None,
725
+ output_attentions: Optional[bool] = None,
726
+ output_hidden_states: Optional[bool] = None,
727
+ return_dict: Optional[bool] = None,
728
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
729
+ r"""
730
+ Returns:
731
+
732
+ """
733
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
734
+ output_hidden_states = (
735
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
736
+ )
737
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
738
+
739
+ if input_ids is None:
740
+ raise ValueError("You have to specify input_ids")
741
+
742
+ input_shape = input_ids.size()
743
+ input_ids = input_ids.view(-1, input_shape[-1])
744
+
745
+ hidden_states = self.embeddings(input_ids=input_ids, position_ids=position_ids)
746
+
747
+ # CLIP's text model uses causal mask, prepare it here.
748
+ # https://github.com/openai/CLIP/blob/cfcffb90e69f37bf2ff1e988237a0fbe41f33c04/clip/model.py#L324
749
+ causal_attention_mask = _create_4d_causal_attention_mask(
750
+ input_shape, hidden_states.dtype, device=hidden_states.device
751
+ )
752
+ # expand attention_mask
753
+ if attention_mask is not None:
754
+ # [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
755
+ attention_mask = _prepare_4d_attention_mask(attention_mask, hidden_states.dtype)
756
+
757
+ encoder_outputs = self.encoder(
758
+ inputs_embeds=hidden_states,
759
+ attention_mask=attention_mask,
760
+ causal_attention_mask=causal_attention_mask,
761
+ output_attentions=output_attentions,
762
+ output_hidden_states=output_hidden_states,
763
+ return_dict=return_dict,
764
+ )
765
+
766
+ last_hidden_state = encoder_outputs[0]
767
+ last_hidden_state = self.final_layer_norm(last_hidden_state)
768
+
769
+ if self.eos_token_id == 2:
770
+ # The `eos_token_id` was incorrect before PR #24773: Let's keep what have been done here.
771
+ # A CLIP model with such `eos_token_id` in the config can't work correctly with extra new tokens added
772
+ # ------------------------------------------------------------
773
+ # text_embeds.shape = [batch_size, sequence_length, transformer.width]
774
+ # take features from the eot embedding (eot_token is the highest number in each sequence)
775
+ # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14
776
+ pooled_output = last_hidden_state[
777
+ torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device),
778
+ input_ids.to(dtype=torch.int, device=last_hidden_state.device).argmax(dim=-1),
779
+ ]
780
+ else:
781
+ # The config gets updated `eos_token_id` from PR #24773 (so the use of exta new tokens is possible)
782
+ pooled_output = last_hidden_state[
783
+ torch.arange(last_hidden_state.shape[0], device=last_hidden_state.device),
784
+ # We need to get the first position of `eos_token_id` value (`pad_token_ids` might equal to `eos_token_id`)
785
+ (input_ids.to(dtype=torch.int, device=last_hidden_state.device) == self.eos_token_id)
786
+ .int()
787
+ .argmax(dim=-1),
788
+ ]
789
+
790
+ if not return_dict:
791
+ return (last_hidden_state, pooled_output) + encoder_outputs[1:]
792
+
793
+ return BaseModelOutputWithPooling(
794
+ last_hidden_state=last_hidden_state,
795
+ pooler_output=pooled_output,
796
+ hidden_states=encoder_outputs.hidden_states,
797
+ attentions=encoder_outputs.attentions,
798
+ )
799
+
800
+
801
+ @add_start_docstrings(
802
+ """The text model from CLIP without any head or projection on top.""",
803
+ CLIP_START_DOCSTRING,
804
+ )
805
+ class QLIPTextModel(QLIPPreTrainedModel):
806
+ config_class = QLIPTextConfig
807
+
808
+ _no_split_modules = ["QLIPTextEmbeddings", "QLIPEncoderLayer"]
809
+
810
+ def __init__(self, config: QLIPTextConfig):
811
+ super().__init__(config)
812
+ self.text_model = QLIPTextTransformer(config)
813
+ # Initialize weights and apply final processing
814
+ self.post_init()
815
+
816
+ def get_input_embeddings(self) -> nn.Module:
817
+ return self.text_model.embeddings.token_embedding
818
+
819
+ def set_input_embeddings(self, value):
820
+ self.text_model.embeddings.token_embedding = value
821
+
822
+ @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING)
823
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=QLIPTextConfig)
824
+ def forward(
825
+ self,
826
+ input_ids: Optional[torch.Tensor] = None,
827
+ attention_mask: Optional[torch.Tensor] = None,
828
+ position_ids: Optional[torch.Tensor] = None,
829
+ output_attentions: Optional[bool] = None,
830
+ output_hidden_states: Optional[bool] = None,
831
+ return_dict: Optional[bool] = None,
832
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
833
+ r"""
834
+ Returns:
835
+
836
+ Examples:
837
+
838
+ ```python
839
+ >>> from transformers import AutoTokenizer, CLIPTextModel
840
+
841
+ >>> model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
842
+ >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
843
+
844
+ >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
845
+
846
+ >>> outputs = model(**inputs)
847
+ >>> last_hidden_state = outputs.last_hidden_state
848
+ >>> pooled_output = outputs.pooler_output # pooled (EOS token) states
849
+ ```"""
850
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
851
+
852
+ return self.text_model(
853
+ input_ids=input_ids,
854
+ attention_mask=attention_mask,
855
+ position_ids=position_ids,
856
+ output_attentions=output_attentions,
857
+ output_hidden_states=output_hidden_states,
858
+ return_dict=return_dict,
859
+ )
860
+
861
+
862
+ class QLIPVisionTransformer(nn.Module):
863
+ def __init__(self, config: QLIPVisionConfig):
864
+ super().__init__()
865
+ self.config = config
866
+ embed_dim = config.hidden_size
867
+
868
+ self.embeddings = QLIPVisionEmbeddings(config)
869
+ if config.rope:
870
+ half_head_dim = config.hidden_size // config.num_attention_heads // 2
871
+ hw_seq_len = config.image_size // config.patch_size
872
+ self.rope = VisionRotaryEmbeddingFast(
873
+ dim=half_head_dim,
874
+ pt_seq_len=16,
875
+ ft_seq_len=hw_seq_len,
876
+ )
877
+ else:
878
+ self.rope = None
879
+ self.encoder = QLIPEncoder(config, rope=self.rope, rope_shift=1)
880
+ self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
881
+
882
+ if config.quantizer == "bsq":
883
+ self.quantizer = BinarySphericalQuantizer(**config.quantizer_cfg)
884
+ self.quantizer_l2_norm = config.quantizer_l2_norm
885
+ if config.quantizer_embed_type == "mlp":
886
+ self.quant_embed = nn.Sequential(
887
+ OrderedDict(
888
+ [
889
+ ("c_fc", nn.Linear(config.hidden_size, config.hidden_size)),
890
+ ("gelu", nn.GELU()),
891
+ ("c_proj", nn.Linear(config.hidden_size, config.quantizer_cfg["embed_dim"])),
892
+ ]
893
+ )
894
+ )
895
+ self.quant_embed_post = nn.Sequential(
896
+ OrderedDict(
897
+ [
898
+ ("c_fc", nn.Linear(config.quantizer_cfg["embed_dim"], config.hidden_size_post_q)),
899
+ ("gelu", nn.GELU()),
900
+ ("c_proj", nn.Linear(config.hidden_size_post_q, config.hidden_size_post_q)),
901
+ ]
902
+ )
903
+ )
904
+ else:
905
+ self.quant_embed = nn.Identity()
906
+ self.quant_embed_post = nn.Identity()
907
+
908
+ @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING)
909
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=QLIPVisionConfig)
910
+ def forward(
911
+ self,
912
+ pixel_values: Optional[torch.FloatTensor] = None,
913
+ output_attentions: Optional[bool] = None,
914
+ output_hidden_states: Optional[bool] = None,
915
+ return_dict: Optional[bool] = None,
916
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
917
+ r"""
918
+ Returns:
919
+
920
+ """
921
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
922
+ output_hidden_states = (
923
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
924
+ )
925
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
926
+
927
+ if pixel_values is None:
928
+ raise ValueError("You have to specify pixel_values")
929
+
930
+ hidden_states = self.embeddings(pixel_values)
931
+
932
+ encoder_outputs = self.encoder(
933
+ inputs_embeds=hidden_states,
934
+ output_attentions=output_attentions,
935
+ output_hidden_states=output_hidden_states,
936
+ return_dict=return_dict,
937
+ )
938
+
939
+ last_hidden_state = encoder_outputs[0]
940
+ pooled_output = last_hidden_state[:, 0, :]
941
+ z = last_hidden_state[:, 1:, :]
942
+ h = self.quant_embed(z)
943
+ if self.quantizer_l2_norm:
944
+ h = F.normalize(h, dim=-1)
945
+ if self.quantizer is not None:
946
+ quant, _, _ = self.quantizer(h)
947
+ else:
948
+ quant = h
949
+ zhat = self.quant_embed_post(quant)
950
+ last_hidden_state = zhat
951
+ pooled_output = self.post_layernorm(pooled_output)
952
+
953
+ if not return_dict:
954
+ return (last_hidden_state, pooled_output) + encoder_outputs[1:]
955
+
956
+ return BaseModelOutputWithPooling(
957
+ last_hidden_state=last_hidden_state,
958
+ pooler_output=pooled_output,
959
+ hidden_states=encoder_outputs.hidden_states,
960
+ attentions=encoder_outputs.attentions,
961
+ )
962
+
963
+
964
+ class QLIPVisionTransformerDecoder(nn.Module):
965
+ def __init__(self, config: QLIPDecoderConfig):
966
+ super().__init__()
967
+ self.config = config
968
+ embed_dim = config.hidden_size
969
+
970
+ num_patches = (config.image_size // config.patch_size) ** 2
971
+ self.patch_shape = (config.image_size // config.patch_size, config.image_size // config.patch_size)
972
+ self.position_embeddings = nn.Parameter(torch.zeros(1, num_patches, embed_dim))
973
+ if config.rope:
974
+ half_head_dim = config.hidden_size // config.num_attention_heads // 2
975
+ hw_seq_len = config.image_size // config.patch_size
976
+ self.rope = VisionRotaryEmbeddingFast(
977
+ dim=half_head_dim,
978
+ pt_seq_len=16,
979
+ ft_seq_len=hw_seq_len,
980
+ )
981
+ else:
982
+ self.rope = None
983
+ self.norm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
984
+ self.encoder = QLIPEncoder(config, rope=self.rope, rope_shift=0)
985
+ self.ffn = nn.Sequential(
986
+ nn.Linear(config.hidden_size, config.intermediate_size),
987
+ nn.Tanh(),
988
+ )
989
+ self.conv_out = nn.Linear(
990
+ in_features=config.intermediate_size,
991
+ out_features=3 * config.patch_size * config.patch_size,
992
+ )
993
+
994
+ @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING)
995
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=QLIPVisionConfig)
996
+ def forward(
997
+ self,
998
+ latents: Optional[torch.FloatTensor] = None,
999
+ output_attentions: Optional[bool] = None,
1000
+ output_hidden_states: Optional[bool] = None,
1001
+ return_dict: Optional[bool] = None,
1002
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
1003
+ r"""
1004
+ Returns:
1005
+
1006
+ """
1007
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1008
+ output_hidden_states = (
1009
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1010
+ )
1011
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1012
+
1013
+ if latents is None:
1014
+ raise ValueError("You have to specify latents")
1015
+
1016
+ hidden_states = self.position_embeddings + latents
1017
+
1018
+ decoder_outputs = self.encoder(
1019
+ inputs_embeds=hidden_states,
1020
+ output_attentions=output_attentions,
1021
+ output_hidden_states=output_hidden_states,
1022
+ return_dict=return_dict,
1023
+ )
1024
+
1025
+ last_hidden_state = decoder_outputs[0]
1026
+ recon = self.conv_out(self.ffn(self.norm(last_hidden_state)))
1027
+ recon_reshaped = rearrange(
1028
+ recon, "b (hh ww) (c sh sw) -> b c (hh sh) (ww sw)",
1029
+ hh=self.patch_shape[0], ww=self.patch_shape[1],
1030
+ sh=self.config.patch_size, sw=self.config.patch_size,
1031
+ )
1032
+ return recon_reshaped
1033
+
1034
+
1035
+ @add_start_docstrings(
1036
+ """The vision model from CLIP without any head or projection on top.""",
1037
+ CLIP_START_DOCSTRING,
1038
+ )
1039
+ class QLIPVisionModel(QLIPPreTrainedModel):
1040
+ config_class = QLIPVisionConfig
1041
+ main_input_name = "pixel_values"
1042
+ _no_split_modules = ["QLIPEncoderLayer"]
1043
+
1044
+ def __init__(self, config: QLIPVisionConfig):
1045
+ super().__init__(config)
1046
+ self.vision_model = QLIPVisionTransformer(config)
1047
+ # Initialize weights and apply final processing
1048
+ self.post_init()
1049
+
1050
+ def get_input_embeddings(self) -> nn.Module:
1051
+ return self.vision_model.embeddings.patch_embedding
1052
+
1053
+ @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING)
1054
+ @replace_return_docstrings(output_type=BaseModelOutputWithPooling, config_class=QLIPVisionConfig)
1055
+ def forward(
1056
+ self,
1057
+ pixel_values: Optional[torch.FloatTensor] = None,
1058
+ output_attentions: Optional[bool] = None,
1059
+ output_hidden_states: Optional[bool] = None,
1060
+ return_dict: Optional[bool] = None,
1061
+ ) -> Union[Tuple, BaseModelOutputWithPooling]:
1062
+ r"""
1063
+ Returns:
1064
+
1065
+ Examples:
1066
+
1067
+ ```python
1068
+ >>> from PIL import Image
1069
+ >>> import requests
1070
+ >>> from transformers import AutoProcessor, CLIPVisionModel
1071
+
1072
+ >>> model = CLIPVisionModel.from_pretrained("openai/clip-vit-base-patch32")
1073
+ >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
1074
+
1075
+ >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
1076
+ >>> image = Image.open(requests.get(url, stream=True).raw)
1077
+
1078
+ >>> inputs = processor(images=image, return_tensors="pt")
1079
+
1080
+ >>> outputs = model(**inputs)
1081
+ >>> last_hidden_state = outputs.last_hidden_state
1082
+ >>> pooled_output = outputs.pooler_output # pooled CLS states
1083
+ ```"""
1084
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1085
+
1086
+ return self.vision_model(
1087
+ pixel_values=pixel_values,
1088
+ output_attentions=output_attentions,
1089
+ output_hidden_states=output_hidden_states,
1090
+ return_dict=return_dict,
1091
+ )
1092
+
1093
+
1094
+ @add_start_docstrings(CLIP_START_DOCSTRING)
1095
+ class QLIPModel(QLIPPreTrainedModel):
1096
+ config_class = QLIPConfig
1097
+
1098
+ def __init__(self, config: QLIPConfig):
1099
+ super().__init__(config)
1100
+
1101
+ if not isinstance(config.text_config, QLIPTextConfig):
1102
+ raise ValueError(
1103
+ "config.text_config is expected to be of type CLIPTextConfig but is of type"
1104
+ f" {type(config.text_config)}."
1105
+ )
1106
+
1107
+ if not isinstance(config.vision_config, QLIPVisionConfig):
1108
+ raise ValueError(
1109
+ "config.vision_config is expected to be of type CLIPVisionConfig but is of type"
1110
+ f" {type(config.vision_config)}."
1111
+ )
1112
+
1113
+ text_config = config.text_config
1114
+ vision_config = config.vision_config
1115
+ decoder_config = config.decoder_config
1116
+
1117
+ self.projection_dim = config.projection_dim
1118
+ self.text_embed_dim = text_config.hidden_size
1119
+ self.vision_embed_dim = vision_config.hidden_size
1120
+
1121
+ self.text_model = QLIPTextTransformer(text_config)
1122
+ self.vision_model = QLIPVisionTransformer(vision_config)
1123
+ self.vision_decoder = QLIPVisionTransformerDecoder(decoder_config)
1124
+
1125
+ self.visual_projection = nn.Linear(self.vision_embed_dim, self.projection_dim, bias=config.vision_projection_bias)
1126
+ self.text_projection = nn.Linear(self.text_embed_dim, self.projection_dim, bias=config.text_projection_bias)
1127
+ self.logit_scale = nn.Parameter(torch.tensor(self.config.logit_scale_init_value))
1128
+
1129
+ # Initialize weights and apply final processing
1130
+ self.post_init()
1131
+
1132
+ @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING)
1133
+ def get_text_features(
1134
+ self,
1135
+ input_ids: Optional[torch.Tensor] = None,
1136
+ attention_mask: Optional[torch.Tensor] = None,
1137
+ position_ids: Optional[torch.Tensor] = None,
1138
+ output_attentions: Optional[bool] = None,
1139
+ output_hidden_states: Optional[bool] = None,
1140
+ return_dict: Optional[bool] = None,
1141
+ ) -> torch.FloatTensor:
1142
+ r"""
1143
+ Returns:
1144
+ text_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The text embeddings obtained by
1145
+ applying the projection layer to the pooled output of [`CLIPTextModel`].
1146
+
1147
+ Examples:
1148
+
1149
+ ```python
1150
+ >>> from transformers import AutoTokenizer, CLIPModel
1151
+
1152
+ >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
1153
+ >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
1154
+
1155
+ >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
1156
+ >>> text_features = model.get_text_features(**inputs)
1157
+ ```"""
1158
+ # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
1159
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1160
+ output_hidden_states = (
1161
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1162
+ )
1163
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1164
+
1165
+ text_outputs = self.text_model(
1166
+ input_ids=input_ids,
1167
+ attention_mask=attention_mask,
1168
+ position_ids=position_ids,
1169
+ output_attentions=output_attentions,
1170
+ output_hidden_states=output_hidden_states,
1171
+ return_dict=return_dict,
1172
+ )
1173
+
1174
+ pooled_output = text_outputs[1]
1175
+ text_features = self.text_projection(pooled_output)
1176
+
1177
+ return text_features
1178
+
1179
+ @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING)
1180
+ def get_image_features(
1181
+ self,
1182
+ pixel_values: Optional[torch.FloatTensor] = None,
1183
+ output_attentions: Optional[bool] = None,
1184
+ output_hidden_states: Optional[bool] = None,
1185
+ return_dict: Optional[bool] = None,
1186
+ ) -> torch.FloatTensor:
1187
+ r"""
1188
+ Returns:
1189
+ image_features (`torch.FloatTensor` of shape `(batch_size, output_dim`): The image embeddings obtained by
1190
+ applying the projection layer to the pooled output of [`CLIPVisionModel`].
1191
+
1192
+ Examples:
1193
+
1194
+ ```python
1195
+ >>> from PIL import Image
1196
+ >>> import requests
1197
+ >>> from transformers import AutoProcessor, CLIPModel
1198
+
1199
+ >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
1200
+ >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
1201
+
1202
+ >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
1203
+ >>> image = Image.open(requests.get(url, stream=True).raw)
1204
+
1205
+ >>> inputs = processor(images=image, return_tensors="pt")
1206
+
1207
+ >>> image_features = model.get_image_features(**inputs)
1208
+ ```"""
1209
+ # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
1210
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1211
+ output_hidden_states = (
1212
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1213
+ )
1214
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1215
+
1216
+ vision_outputs = self.vision_model(
1217
+ pixel_values=pixel_values,
1218
+ output_attentions=output_attentions,
1219
+ output_hidden_states=output_hidden_states,
1220
+ return_dict=return_dict,
1221
+ )
1222
+
1223
+ pooled_output = vision_outputs[1] # pooled_output
1224
+ image_features = self.visual_projection(pooled_output)
1225
+
1226
+ return image_features
1227
+
1228
+ @add_start_docstrings_to_model_forward(CLIP_INPUTS_DOCSTRING)
1229
+ @replace_return_docstrings(output_type=QLIPOutput, config_class=QLIPConfig)
1230
+ def forward(
1231
+ self,
1232
+ input_ids: Optional[torch.LongTensor] = None,
1233
+ pixel_values: Optional[torch.FloatTensor] = None,
1234
+ attention_mask: Optional[torch.Tensor] = None,
1235
+ position_ids: Optional[torch.LongTensor] = None,
1236
+ return_loss: Optional[bool] = None,
1237
+ output_attentions: Optional[bool] = None,
1238
+ output_hidden_states: Optional[bool] = None,
1239
+ return_dict: Optional[bool] = None,
1240
+ ) -> Union[Tuple, QLIPOutput]:
1241
+ r"""
1242
+ Returns:
1243
+
1244
+ Examples:
1245
+
1246
+ ```python
1247
+ >>> from PIL import Image
1248
+ >>> import requests
1249
+ >>> from transformers import AutoProcessor, CLIPModel
1250
+
1251
+ >>> model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
1252
+ >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
1253
+
1254
+ >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
1255
+ >>> image = Image.open(requests.get(url, stream=True).raw)
1256
+
1257
+ >>> inputs = processor(
1258
+ ... text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True
1259
+ ... )
1260
+
1261
+ >>> outputs = model(**inputs)
1262
+ >>> logits_per_image = outputs.logits_per_image # this is the image-text similarity score
1263
+ >>> probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
1264
+ ```"""
1265
+ # Use CLIP model's config for some fields (if specified) instead of those of vision & text components.
1266
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
1267
+ output_hidden_states = (
1268
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
1269
+ )
1270
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1271
+
1272
+ vision_outputs = self.vision_model(
1273
+ pixel_values=pixel_values,
1274
+ output_attentions=output_attentions,
1275
+ output_hidden_states=output_hidden_states,
1276
+ return_dict=return_dict,
1277
+ )
1278
+
1279
+ text_outputs = self.text_model(
1280
+ input_ids=input_ids,
1281
+ attention_mask=attention_mask,
1282
+ position_ids=position_ids,
1283
+ output_attentions=output_attentions,
1284
+ output_hidden_states=output_hidden_states,
1285
+ return_dict=return_dict,
1286
+ )
1287
+
1288
+ image_embeds = vision_outputs[1]
1289
+ image_embeds = self.visual_projection(image_embeds)
1290
+
1291
+ text_embeds = text_outputs[1]
1292
+ text_embeds = self.text_projection(text_embeds)
1293
+
1294
+ last_hidden_state = vision_outputs[0]
1295
+ recon = self.vision_decoder(last_hidden_state)
1296
+
1297
+ # normalized features
1298
+ image_embeds = image_embeds / image_embeds.norm(p=2, dim=-1, keepdim=True)
1299
+ text_embeds = text_embeds / text_embeds.norm(p=2, dim=-1, keepdim=True)
1300
+
1301
+ # cosine similarity as logits
1302
+ logit_scale = self.logit_scale.exp()
1303
+ logits_per_text = torch.matmul(text_embeds, image_embeds.t()) * logit_scale
1304
+ logits_per_image = logits_per_text.t()
1305
+
1306
+ loss = None
1307
+ if return_loss:
1308
+ loss = clip_loss(logits_per_text)
1309
+
1310
+ if not return_dict:
1311
+ output = (logits_per_image, logits_per_text, text_embeds, image_embeds, text_outputs, vision_outputs)
1312
+ return ((loss,) + output) if loss is not None else output
1313
+
1314
+ return QLIPOutput(
1315
+ loss=loss,
1316
+ logits_per_image=logits_per_image,
1317
+ logits_per_text=logits_per_text,
1318
+ text_embeds=text_embeds,
1319
+ image_embeds=image_embeds,
1320
+ text_model_output=text_outputs,
1321
+ vision_model_output=vision_outputs,
1322
+ reconstructions=recon,
1323
+ )
1324
+
1325
+
1326
+ @add_start_docstrings(
1327
+ """
1328
+ CLIP Text Model with a projection layer on top (a linear layer on top of the pooled output).
1329
+ """,
1330
+ CLIP_START_DOCSTRING,
1331
+ )
1332
+ class QLIPTextModelWithProjection(QLIPPreTrainedModel):
1333
+ config_class = QLIPTextConfig
1334
+
1335
+ _no_split_modules = ["QLIPTextEmbeddings", "QLIPEncoderLayer"]
1336
+
1337
+ def __init__(self, config: QLIPTextConfig):
1338
+ super().__init__(config)
1339
+
1340
+ self.text_model = QLIPTextTransformer(config)
1341
+
1342
+ self.text_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)
1343
+
1344
+ # Initialize weights and apply final processing
1345
+ self.post_init()
1346
+
1347
+ def get_input_embeddings(self) -> nn.Module:
1348
+ return self.text_model.embeddings.token_embedding
1349
+
1350
+ def set_input_embeddings(self, value):
1351
+ self.text_model.embeddings.token_embedding = value
1352
+
1353
+ @add_start_docstrings_to_model_forward(CLIP_TEXT_INPUTS_DOCSTRING)
1354
+ @replace_return_docstrings(output_type=QLIPTextModelOutput, config_class=QLIPTextConfig)
1355
+ def forward(
1356
+ self,
1357
+ input_ids: Optional[torch.Tensor] = None,
1358
+ attention_mask: Optional[torch.Tensor] = None,
1359
+ position_ids: Optional[torch.Tensor] = None,
1360
+ output_attentions: Optional[bool] = None,
1361
+ output_hidden_states: Optional[bool] = None,
1362
+ return_dict: Optional[bool] = None,
1363
+ ) -> Union[Tuple, QLIPTextModelOutput]:
1364
+ r"""
1365
+ Returns:
1366
+
1367
+ Examples:
1368
+
1369
+ ```python
1370
+ >>> from transformers import AutoTokenizer, CLIPTextModelWithProjection
1371
+
1372
+ >>> model = CLIPTextModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
1373
+ >>> tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
1374
+
1375
+ >>> inputs = tokenizer(["a photo of a cat", "a photo of a dog"], padding=True, return_tensors="pt")
1376
+
1377
+ >>> outputs = model(**inputs)
1378
+ >>> text_embeds = outputs.text_embeds
1379
+ ```"""
1380
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1381
+
1382
+ text_outputs = self.text_model(
1383
+ input_ids=input_ids,
1384
+ attention_mask=attention_mask,
1385
+ position_ids=position_ids,
1386
+ output_attentions=output_attentions,
1387
+ output_hidden_states=output_hidden_states,
1388
+ return_dict=return_dict,
1389
+ )
1390
+
1391
+ pooled_output = text_outputs[1]
1392
+
1393
+ text_embeds = self.text_projection(pooled_output)
1394
+
1395
+ if not return_dict:
1396
+ outputs = (text_embeds, text_outputs[0]) + text_outputs[2:]
1397
+ return tuple(output for output in outputs if output is not None)
1398
+
1399
+ return QLIPTextModelOutput(
1400
+ text_embeds=text_embeds,
1401
+ last_hidden_state=text_outputs.last_hidden_state,
1402
+ hidden_states=text_outputs.hidden_states,
1403
+ attentions=text_outputs.attentions,
1404
+ )
1405
+
1406
+
1407
+ @add_start_docstrings(
1408
+ """
1409
+ CLIP Vision Model with a projection layer on top (a linear layer on top of the pooled output).
1410
+ """,
1411
+ CLIP_START_DOCSTRING,
1412
+ )
1413
+ class QLIPVisionModelWithProjection(QLIPPreTrainedModel):
1414
+ config_class = QLIPVisionConfig
1415
+ main_input_name = "pixel_values"
1416
+
1417
+ def __init__(self, config: QLIPVisionConfig):
1418
+ super().__init__(config)
1419
+
1420
+ self.vision_model = QLIPVisionTransformer(config)
1421
+
1422
+ self.visual_projection = nn.Linear(config.hidden_size, config.projection_dim, bias=False)
1423
+
1424
+ # Initialize weights and apply final processing
1425
+ self.post_init()
1426
+
1427
+ def get_input_embeddings(self) -> nn.Module:
1428
+ return self.vision_model.embeddings.patch_embedding
1429
+
1430
+ @add_start_docstrings_to_model_forward(CLIP_VISION_INPUTS_DOCSTRING)
1431
+ @replace_return_docstrings(output_type=QLIPVisionModelOutput, config_class=QLIPVisionConfig)
1432
+ def forward(
1433
+ self,
1434
+ pixel_values: Optional[torch.FloatTensor] = None,
1435
+ output_attentions: Optional[bool] = None,
1436
+ output_hidden_states: Optional[bool] = None,
1437
+ return_dict: Optional[bool] = None,
1438
+ ) -> Union[Tuple, QLIPVisionModelOutput]:
1439
+ r"""
1440
+ Returns:
1441
+
1442
+ Examples:
1443
+
1444
+ ```python
1445
+ >>> from PIL import Image
1446
+ >>> import requests
1447
+ >>> from transformers import AutoProcessor, CLIPVisionModelWithProjection
1448
+
1449
+ >>> model = CLIPVisionModelWithProjection.from_pretrained("openai/clip-vit-base-patch32")
1450
+ >>> processor = AutoProcessor.from_pretrained("openai/clip-vit-base-patch32")
1451
+
1452
+ >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
1453
+ >>> image = Image.open(requests.get(url, stream=True).raw)
1454
+
1455
+ >>> inputs = processor(images=image, return_tensors="pt")
1456
+
1457
+ >>> outputs = model(**inputs)
1458
+ >>> image_embeds = outputs.image_embeds
1459
+ ```"""
1460
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1461
+
1462
+ vision_outputs = self.vision_model(
1463
+ pixel_values=pixel_values,
1464
+ output_attentions=output_attentions,
1465
+ output_hidden_states=output_hidden_states,
1466
+ return_dict=return_dict,
1467
+ )
1468
+
1469
+ pooled_output = vision_outputs[1] # pooled_output
1470
+
1471
+ image_embeds = self.visual_projection(pooled_output)
1472
+
1473
+ if not return_dict:
1474
+ outputs = (image_embeds, vision_outputs[0]) + vision_outputs[2:]
1475
+ return tuple(output for output in outputs if output is not None)
1476
+
1477
+ return QLIPVisionModelOutput(
1478
+ image_embeds=image_embeds,
1479
+ last_hidden_state=vision_outputs.last_hidden_state,
1480
+ hidden_states=vision_outputs.hidden_states,
1481
+ attentions=vision_outputs.attentions,
1482
+ )
preprocessor_config.json ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "crop_size": 392,
3
+ "do_center_crop": true,
4
+ "do_normalize": true,
5
+ "do_resize": true,
6
+ "feature_extractor_type": "CLIPFeatureExtractor",
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "resample": 3,
18
+ "size": 392
19
+ }
rope.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) 2024, NVIDIA Corporation & Affiliates. All rights reserved.
2
+ #
3
+ # This work is made available under the Nvidia Source Code License-NC.
4
+ # To view a copy of this license, visit
5
+ # https://github.com/NVlabs/QLIP/blob/main/LICENSE
6
+
7
+ # MIT License
8
+
9
+ # Copyright (c) 2022 BAAI-Vision
10
+
11
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
12
+ # of this software and associated documentation files (the "Software"), to deal
13
+ # in the Software without restriction, including without limitation the rights
14
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
15
+ # copies of the Software, and to permit persons to whom the Software is
16
+ # furnished to do so, subject to the following conditions:
17
+
18
+ # The above copyright notice and this permission notice shall be included in all
19
+ # copies or substantial portions of the Software.
20
+
21
+ # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
22
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
23
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
24
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
25
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
26
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
27
+ # SOFTWARE.
28
+
29
+
30
+ from math import pi
31
+ import torch
32
+ from torch import nn
33
+ from einops import rearrange, repeat
34
+ import logging
35
+
36
+
37
+ def broadcat(tensors, dim = -1):
38
+ num_tensors = len(tensors)
39
+ shape_lens = set(list(map(lambda t: len(t.shape), tensors)))
40
+ assert len(shape_lens) == 1, 'tensors must all have the same number of dimensions'
41
+ shape_len = list(shape_lens)[0]
42
+ dim = (dim + shape_len) if dim < 0 else dim
43
+ dims = list(zip(*map(lambda t: list(t.shape), tensors)))
44
+ expandable_dims = [(i, val) for i, val in enumerate(dims) if i != dim]
45
+ assert all([*map(lambda t: len(set(t[1])) <= 2, expandable_dims)]), 'invalid dimensions for broadcastable concatentation'
46
+ max_dims = list(map(lambda t: (t[0], max(t[1])), expandable_dims))
47
+ expanded_dims = list(map(lambda t: (t[0], (t[1],) * num_tensors), max_dims))
48
+ expanded_dims.insert(dim, (dim, dims[dim]))
49
+ expandable_shapes = list(zip(*map(lambda t: t[1], expanded_dims)))
50
+ tensors = list(map(lambda t: t[0].expand(*t[1]), zip(tensors, expandable_shapes)))
51
+ return torch.cat(tensors, dim = dim)
52
+
53
+ def rotate_half(x):
54
+ x = rearrange(x, '... (d r) -> ... d r', r = 2)
55
+ x1, x2 = x.unbind(dim = -1)
56
+ x = torch.stack((-x2, x1), dim = -1)
57
+ return rearrange(x, '... d r -> ... (d r)')
58
+
59
+
60
+ class VisionRotaryEmbeddingFast(nn.Module):
61
+ def __init__(
62
+ self,
63
+ dim,
64
+ pt_seq_len,
65
+ ft_seq_len=None,
66
+ custom_freqs = None,
67
+ freqs_for = 'lang',
68
+ theta = 10000,
69
+ max_freq = 10,
70
+ num_freqs = 1,
71
+ patch_dropout = 0.
72
+ ):
73
+ super().__init__()
74
+ if custom_freqs:
75
+ freqs = custom_freqs
76
+ elif freqs_for == 'lang':
77
+ freqs = 1. / (theta ** (torch.arange(0, dim, 2)[:(dim // 2)].float() / dim))
78
+ elif freqs_for == 'pixel':
79
+ freqs = torch.linspace(1., max_freq / 2, dim // 2) * pi
80
+ elif freqs_for == 'constant':
81
+ freqs = torch.ones(num_freqs).float()
82
+ else:
83
+ raise ValueError(f'unknown modality {freqs_for}')
84
+
85
+ if ft_seq_len is None: ft_seq_len = pt_seq_len
86
+ t = torch.arange(ft_seq_len) / ft_seq_len * pt_seq_len
87
+
88
+ freqs = torch.einsum('..., f -> ... f', t, freqs)
89
+ freqs = repeat(freqs, '... n -> ... (n r)', r = 2)
90
+ freqs = broadcat((freqs[:, None, :], freqs[None, :, :]), dim = -1)
91
+
92
+ freqs_cos = freqs.cos().view(-1, freqs.shape[-1])
93
+ freqs_sin = freqs.sin().view(-1, freqs.shape[-1])
94
+
95
+ self.patch_dropout = patch_dropout
96
+
97
+ self.register_buffer("freqs_cos", freqs_cos)
98
+ self.register_buffer("freqs_sin", freqs_sin)
99
+
100
+ logging.info(f'Shape of rope freq: {self.freqs_cos.shape}')
101
+
102
+ def forward(self, t, patch_indices_keep=None):
103
+ if patch_indices_keep is not None:
104
+ batch = t.size()[0]
105
+ batch_indices = torch.arange(batch)
106
+ batch_indices = batch_indices[..., None]
107
+
108
+ freqs_cos = repeat(self.freqs_cos, 'i j -> n i m j', n=t.shape[0], m=t.shape[1])
109
+ freqs_sin = repeat(self.freqs_sin, 'i j -> n i m j', n=t.shape[0], m=t.shape[1])
110
+
111
+ freqs_cos = freqs_cos[batch_indices, patch_indices_keep]
112
+ freqs_cos = rearrange(freqs_cos, 'n i m j -> n m i j')
113
+ freqs_sin = freqs_sin[batch_indices, patch_indices_keep]
114
+ freqs_sin = rearrange(freqs_sin, 'n i m j -> n m i j')
115
+
116
+ return t * freqs_cos + rotate_half(t) * freqs_sin
117
+
118
+ return t * self.freqs_cos + rotate_half(t) * self.freqs_sin
special_tokens_map.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"bos_token": {"content": "<|startoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "eos_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "unk_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true}, "pad_token": "<|endoftext|>"}
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "bos_token": {"content": "<|startoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "eos_token": {"content": "<|endoftext|>", "single_word": false, "lstrip": false, "rstrip": false, "normalized": true, "__type": "AddedToken"}, "pad_token": "<|endoftext|>", "add_prefix_space": false, "errors": "replace", "do_lower_case": true, "name_or_path": "openai/clip-vit-base-patch32", "model_max_length": 77, "special_tokens_map_file": "./special_tokens_map.json", "tokenizer_class": "CLIPTokenizer"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff