File size: 2,697 Bytes
b01dc6c
 
 
 
 
d6e5af2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
license: other
license_name: myvlm-snap-license
license_link: https://github.com/snap-research/MyVLM/blob/master/LICENSE
---
# MyVLM

**Paper:** https://arxiv.org/abs/2403.14599
 
**Project Page:** https://snap-research.github.io/MyVLM/

**Code:** https://github.com/snap-research/MyVLM


# MyVLM Concept Heads & Concept Embeddings
As part of our [MyVLM code](https://github.com/snap-research/MyVLM) release, we have also released pretrained concept heads and concept embeddings for all 29 objects used in the paper. 

These can be loaded using the `CLIPConceptHead` class and our inference scripts for reproducing the paper results.

This repository contains 5 concept heads for each object, representing five different training seeds and sets of images used for training.

## Concept Heads

<p align="center">
<img src="docs/concept_head.jpg" width="200px"/>  
For each user-specific concept, we introduce an external concept head designed to identify the presence of the concept within an image.
</p>


As mentioned in the paper, we have two types of concept heads: 
1. A facial recognition model for recognizing individuals
2. A CLIP-based concept head for recognizing user-specific objects

For faces, we use the `buffalo_l` face detection and face recognition model from [insightface](https://github.com/deepinsight/insightface/tree/master).
See `concept_heads/face_recognition/head.py` for usage.

For objects, we train a single linear layer over features extracted from a CLIP ViT-H/14 model (`DFN5B-CLIP-ViT-H-14-384`).  
See `concept_heads/clip/head.py` for usage. 


## Concept Embeddings
<p align="center">
<img src="docs/method.jpg" width="800px"/>  
Having identified the presence of a user-specific concept within an image, a learned concept embedding representing an object or individual is used to guide the LLM in incorporating the concept into its personalized textual response.
</p>


The concept embeddings are saved as `.pt` files in the following format: 

    ```
    {
      10: {
        "keys": torch.Tensor(),    # the keys used for optimizing the concept embedding
        "values": torch.Tensor(),  # the concept embedding itself
      },
      ...
      20: {
        "keys": torch.Tensor(),    
        "values": torch.Tensor(),  
      },
      ...
    }
    ```
where each entry in the dictionary represents a different checkpoint during the optimization process.

We provide the concept embeddings for personalized captioning using both BLIP-2 and LLaVA.


## License
This sample code is made available by Snap Inc. for non-commercial, academic purposes only.  
Please see the full license [here](https://github.com/snap-research/MyVLM/blob/master/LICENSE).