yuvalalaluf
/

MyVLM

Model card Files Files and versions Community

MyVLM / README.md

yuvalalaluf

Update README.md

d6e5af2 verified 11 months ago

preview code

raw

history blame contribute delete

2.7 kB

	---
	license: other
	license_name: myvlm-snap-license
	license_link: https://github.com/snap-research/MyVLM/blob/master/LICENSE
	---
	# MyVLM

	Paper: https://arxiv.org/abs/2403.14599

	Project Page: https://snap-research.github.io/MyVLM/

	Code: https://github.com/snap-research/MyVLM


	# MyVLM Concept Heads & Concept Embeddings
	As part of our [MyVLM code](https://github.com/snap-research/MyVLM) release, we have also released pretrained concept heads and concept embeddings for all 29 objects used in the paper.

	These can be loaded using the `CLIPConceptHead` class and our inference scripts for reproducing the paper results.

	This repository contains 5 concept heads for each object, representing five different training seeds and sets of images used for training.

	## Concept Heads

	<p align="center">
	<img src="docs/concept_head.jpg" width="200px"/>
	For each user-specific concept, we introduce an external concept head designed to identify the presence of the concept within an image.
	</p>


	As mentioned in the paper, we have two types of concept heads:
	1. A facial recognition model for recognizing individuals
	2. A CLIP-based concept head for recognizing user-specific objects

	For faces, we use the `buffalo_l` face detection and face recognition model from [insightface](https://github.com/deepinsight/insightface/tree/master).
	See `concept_heads/face_recognition/head.py` for usage.

	For objects, we train a single linear layer over features extracted from a CLIP ViT-H/14 model (`DFN5B-CLIP-ViT-H-14-384`).
	See `concept_heads/clip/head.py` for usage.


	## Concept Embeddings
	<p align="center">
	<img src="docs/method.jpg" width="800px"/>
	Having identified the presence of a user-specific concept within an image, a learned concept embedding representing an object or individual is used to guide the LLM in incorporating the concept into its personalized textual response.
	</p>


	The concept embeddings are saved as `.pt` files in the following format:

	```
	{
	10: {
	"keys": torch.Tensor(), # the keys used for optimizing the concept embedding
	"values": torch.Tensor(), # the concept embedding itself
	},
	...
	20: {
	"keys": torch.Tensor(),
	"values": torch.Tensor(),
	},
	...
	}
	```
	where each entry in the dictionary represents a different checkpoint during the optimization process.

	We provide the concept embeddings for personalized captioning using both BLIP-2 and LLaVA.


	## License
	This sample code is made available by Snap Inc. for non-commercial, academic purposes only.
	Please see the full license [here](https://github.com/snap-research/MyVLM/blob/master/LICENSE).