xiaorui638 commited on
Commit
27b1752
·
verified ·
1 Parent(s): 607ffa0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +60 -0
README.md CHANGED
@@ -13,6 +13,7 @@ Authors: [Rui Xiao](https://www.eml-munich.de/people/rui-xiao), [Sanghwan Kim](h
13
  FLAIR was introduced in the paper [FLAIR: VLM with Fine-grained Language-informed Image Representations](https://arxiv.org/abs/2412.03561). Based on ViT-B-16 Model from [OpenCLIP](https://github.com/mlfoundations/open_clip), FLAIR features text-conditioned attention pooling at the end of its vision transformer. Pre-trained on MLLM-recaptioned datasets from [DreamLIP](https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions), FALIR achieves strong performance in tasks such as zero-shot image-text retrieval and zero-shot segmentation.
14
 
15
  **Usage**
 
16
  We offer the detailed usage in our [Github repo](https://github.com/ExplainableML/flair). Example Usage:
17
 
18
  ```python
@@ -46,8 +47,67 @@ with torch.no_grad(), torch.cuda.amp.autocast():
46
  print("logits get using clip's way:", clip_logits) # [12.4609, 15.6797, -3.8535, -0.2281]
47
  ```
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
 
50
 
 
 
51
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
52
 
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
  FLAIR was introduced in the paper [FLAIR: VLM with Fine-grained Language-informed Image Representations](https://arxiv.org/abs/2412.03561). Based on ViT-B-16 Model from [OpenCLIP](https://github.com/mlfoundations/open_clip), FLAIR features text-conditioned attention pooling at the end of its vision transformer. Pre-trained on MLLM-recaptioned datasets from [DreamLIP](https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions), FALIR achieves strong performance in tasks such as zero-shot image-text retrieval and zero-shot segmentation.
14
 
15
  **Usage**
16
+
17
  We offer the detailed usage in our [Github repo](https://github.com/ExplainableML/flair). Example Usage:
18
 
19
  ```python
 
47
  print("logits get using clip's way:", clip_logits) # [12.4609, 15.6797, -3.8535, -0.2281]
48
  ```
49
 
50
+ As the primary method for FLAIR to generate logits, FLAIR utilizes the text-conditioned attention pooling to pool the local image tokens, generating language-informed image representations. The logits are generated by multiplying with the text features:
51
+
52
+ ```python
53
+ def get_logits(self, image, text):
54
+ """
55
+ FLAIR's way ot get the logits. Only used as a minimal example to get the logits, not used in training or inference at this stage
56
+ """
57
+ global_image_token, local_image_tokens = self.encode_image(image)
58
+ global_text_token, _ = self.encode_text(text)
59
+ global_text_token = self.text_post(global_text_token) # (B*K, D)
60
+ global_image_token, local_image_tokens = self.image_post(global_image_token), self.image_post(
61
+ local_image_tokens) # (B, D), (B, L, D)
62
+ batch_size = global_image_token.shape[0]
63
+
64
+ # Broadcast the global text token to (B, B*K, D), this is too costly in large-scale training, so we downsample them to (B, B+K-1, D) in training
65
+ global_text_token = global_text_token.unsqueeze(0).expand(batch_size, -1, -1)
66
+
67
+ local_image_features = self.visual_proj(global_text_token, local_image_tokens, local_image_tokens) # (B, B*K, D)
68
 
69
+ text_features, image_features = F.normalize(global_text_token, dim=-1), F.normalize(local_image_features, dim=-1)
70
 
71
+ image_logits = self.logit_scale.exp() * torch.einsum('bij,bij->bi', image_features, text_features) # (B, B*K)
72
+ image_logits += self.logit_bias
73
 
74
+ text_logits = image_logits.T
75
+
76
+ return image_logits, text_logits
77
+ ```
78
+
79
+ Thanks to teh global loss, FLAIR also enforces the matching between global-level image and text features. Therefore, just like the originally CLIP does, FLAIR could also produce logits only considering global image and text features.
80
+
81
+ ```python
82
+ def get_logits_as_clip(self, image, text):
83
+ """
84
+ FLAIR could also generate the global-to-global logits as the original CLIP does
85
+ """
86
+ global_image_token, _ = self.encode_image(image)
87
+ global_text_token, _ = self.encode_text(text)
88
 
89
 
90
+ global_image_token = self.image_post(global_image_token) # (B, D)
91
+ global_text_token = self.text_post(global_text_token) # (B*K, D)
92
+
93
+ image_features, text_features = F.normalize(global_image_token, dim=-1), F.normalize(global_text_token, dim=-1)
94
+
95
+ image_logits = self.logit_scale.exp() * image_features @ text_features.t()
96
+ text_logits = image_logits.T
97
+
98
+ return image_logits, text_logits
99
+ ```
100
+
101
+ **Citation**
102
+
103
+ If you find our work useful, please consider citing:
104
+
105
+ ```bibtex
106
+ @article{xiao2024flair,
107
+ title={FLAIR: VLM with Fine-grained Language-informed Image Representations},
108
+ author={Xiao, Rui and Kim, Sanghwan and Georgescu, Mariana-Iuliana and Akata, Zeynep and Alaniz, Stephan},
109
+ journal={arXiv preprint arXiv:2412.03561},
110
+ year={2024}
111
+ }
112
+ ```
113
+