|
--- |
|
license: other |
|
license_name: myvlm-snap-license |
|
license_link: https://github.com/snap-research/MyVLM/blob/master/LICENSE |
|
--- |
|
# MyVLM |
|
|
|
**Paper:** https://arxiv.org/abs/2403.14599 |
|
|
|
**Project Page:** https://snap-research.github.io/MyVLM/ |
|
|
|
**Code:** https://github.com/snap-research/MyVLM |
|
|
|
|
|
# MyVLM Concept Heads & Concept Embeddings |
|
As part of our [MyVLM code](https://github.com/snap-research/MyVLM) release, we have also released pretrained concept heads and concept embeddings for all 29 objects used in the paper. |
|
|
|
These can be loaded using the `CLIPConceptHead` class and our inference scripts for reproducing the paper results. |
|
|
|
This repository contains 5 concept heads for each object, representing five different training seeds and sets of images used for training. |
|
|
|
## Concept Heads |
|
|
|
<p align="center"> |
|
<img src="docs/concept_head.jpg" width="200px"/> |
|
For each user-specific concept, we introduce an external concept head designed to identify the presence of the concept within an image. |
|
</p> |
|
|
|
|
|
As mentioned in the paper, we have two types of concept heads: |
|
1. A facial recognition model for recognizing individuals |
|
2. A CLIP-based concept head for recognizing user-specific objects |
|
|
|
For faces, we use the `buffalo_l` face detection and face recognition model from [insightface](https://github.com/deepinsight/insightface/tree/master). |
|
See `concept_heads/face_recognition/head.py` for usage. |
|
|
|
For objects, we train a single linear layer over features extracted from a CLIP ViT-H/14 model (`DFN5B-CLIP-ViT-H-14-384`). |
|
See `concept_heads/clip/head.py` for usage. |
|
|
|
|
|
## Concept Embeddings |
|
<p align="center"> |
|
<img src="docs/method.jpg" width="800px"/> |
|
Having identified the presence of a user-specific concept within an image, a learned concept embedding representing an object or individual is used to guide the LLM in incorporating the concept into its personalized textual response. |
|
</p> |
|
|
|
|
|
The concept embeddings are saved as `.pt` files in the following format: |
|
|
|
``` |
|
{ |
|
10: { |
|
"keys": torch.Tensor(), # the keys used for optimizing the concept embedding |
|
"values": torch.Tensor(), # the concept embedding itself |
|
}, |
|
... |
|
20: { |
|
"keys": torch.Tensor(), |
|
"values": torch.Tensor(), |
|
}, |
|
... |
|
} |
|
``` |
|
where each entry in the dictionary represents a different checkpoint during the optimization process. |
|
|
|
We provide the concept embeddings for personalized captioning using both BLIP-2 and LLaVA. |
|
|
|
|
|
## License |
|
This sample code is made available by Snap Inc. for non-commercial, academic purposes only. |
|
Please see the full license [here](https://github.com/snap-research/MyVLM/blob/master/LICENSE). |
|
|