File size: 2,697 Bytes
b01dc6c d6e5af2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 |
---
license: other
license_name: myvlm-snap-license
license_link: https://github.com/snap-research/MyVLM/blob/master/LICENSE
---
# MyVLM
**Paper:** https://arxiv.org/abs/2403.14599
**Project Page:** https://snap-research.github.io/MyVLM/
**Code:** https://github.com/snap-research/MyVLM
# MyVLM Concept Heads & Concept Embeddings
As part of our [MyVLM code](https://github.com/snap-research/MyVLM) release, we have also released pretrained concept heads and concept embeddings for all 29 objects used in the paper.
These can be loaded using the `CLIPConceptHead` class and our inference scripts for reproducing the paper results.
This repository contains 5 concept heads for each object, representing five different training seeds and sets of images used for training.
## Concept Heads
<p align="center">
<img src="docs/concept_head.jpg" width="200px"/>
For each user-specific concept, we introduce an external concept head designed to identify the presence of the concept within an image.
</p>
As mentioned in the paper, we have two types of concept heads:
1. A facial recognition model for recognizing individuals
2. A CLIP-based concept head for recognizing user-specific objects
For faces, we use the `buffalo_l` face detection and face recognition model from [insightface](https://github.com/deepinsight/insightface/tree/master).
See `concept_heads/face_recognition/head.py` for usage.
For objects, we train a single linear layer over features extracted from a CLIP ViT-H/14 model (`DFN5B-CLIP-ViT-H-14-384`).
See `concept_heads/clip/head.py` for usage.
## Concept Embeddings
<p align="center">
<img src="docs/method.jpg" width="800px"/>
Having identified the presence of a user-specific concept within an image, a learned concept embedding representing an object or individual is used to guide the LLM in incorporating the concept into its personalized textual response.
</p>
The concept embeddings are saved as `.pt` files in the following format:
```
{
10: {
"keys": torch.Tensor(), # the keys used for optimizing the concept embedding
"values": torch.Tensor(), # the concept embedding itself
},
...
20: {
"keys": torch.Tensor(),
"values": torch.Tensor(),
},
...
}
```
where each entry in the dictionary represents a different checkpoint during the optimization process.
We provide the concept embeddings for personalized captioning using both BLIP-2 and LLaVA.
## License
This sample code is made available by Snap Inc. for non-commercial, academic purposes only.
Please see the full license [here](https://github.com/snap-research/MyVLM/blob/master/LICENSE).
|