memray commited on
Commit
12a299a
·
verified ·
1 Parent(s): d77bcea

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -3
README.md CHANGED
@@ -1,3 +1,101 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - TIGER-Lab/MMEB-train
5
+ language:
6
+ - en
7
+ base_model:
8
+ - Qwen/Qwen2-VL-7B-Instruct
9
+ library_name: transformers
10
+ ---
11
+
12
+ A new checkpoint trained using [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) with an enhanced training setup (LoRA tuning, batch size of 2048, maximum sub-dataset size of 100k). This model has shown significantly improved performance on MMEB & Flickr30K compared to the previous models using Phi-3.5 and llava-v1.6-mistral as backbone.
13
+
14
+ This repo contains the code and data for [VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks](https://arxiv.org/abs/2410.05160). In this paper, we focus on building a unified multimodal embedding model suitable for a wide range of tasks. Our approach is based on transforming an existing, well-trained Vision-Language Model (VLM) into an embedding model.
15
+
16
+ ## Github
17
+ - [Github](https://github.com/TIGER-AI-Lab/VLM2Vec)
18
+
19
+
20
+ ## Data
21
+
22
+ Our model is being trained on MMEB-train and evaluated on MMEB-eval with contrastive learning. We only use in-batch negatives for training.
23
+
24
+ - Train data: https://huggingface.co/datasets/TIGER-Lab/MMEB-train
25
+ - Eval data: https://huggingface.co/datasets/TIGER-Lab/MMEB-eval
26
+
27
+
28
+ ## Performance
29
+ This model outperforms the baselines and previous version of VLM2Vec by a large margin.
30
+
31
+ ![image/png](https://github.com/TIGER-AI-Lab/VLM2Vec/blob/main/figures/vlm2vec_results.png?raw=true)
32
+
33
+
34
+ ## How to use VLM2Vec
35
+ (More details please refer to our Github repo, here is just a simple demo.)
36
+
37
+ First you can clone our github
38
+ ```bash
39
+ git clone https://github.com/TIGER-AI-Lab/VLM2Vec.git
40
+ pip -r requirements.txt
41
+ ```
42
+
43
+ ```python
44
+ from src.model import MMEBModel
45
+ from src.arguments import ModelArguments
46
+ from src.utils import load_processor
47
+
48
+ import torch
49
+ from transformers import HfArgumentParser, AutoProcessor
50
+ from PIL import Image
51
+ import numpy as np
52
+
53
+
54
+ model_args = (
55
+ model_name='TIGER-Lab/VLM2Vec-Qwen2VL',
56
+ pooling='last',
57
+ normalize=True,
58
+ model_backbone='qwen2_vl')
59
+
60
+ processor = load_processor(model_args)
61
+
62
+ model = MMEBModel.load(model_args)
63
+ model.eval()
64
+ model = model.to('cuda', dtype=torch.bfloat16)
65
+
66
+ # Image + Text -> Text
67
+ inputs = processor(text='<image> Represent the given image with the following question: What is in the image',
68
+ images=Image.open('figures/example.jpg'),
69
+ return_tensors="pt")
70
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
71
+ qry_output = model(qry=inputs)["qry_reps"]
72
+
73
+ string = 'A cat and a dog'
74
+ inputs = processor(text=string,
75
+ images=None,
76
+ return_tensors="pt")
77
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
78
+ tgt_output = model(tgt=inputs)["tgt_reps"]
79
+ print(string, '=', model.compute_similarity(qry_output, tgt_output))
80
+ ## A cat and a dog = tensor([[0.4414]], device='cuda:0', dtype=torch.bfloat16)
81
+
82
+ string = 'A cat and a tiger'
83
+ inputs = processor(text=string,
84
+ images=None,
85
+ return_tensors="pt")
86
+ inputs = {key: value.to('cuda') for key, value in inputs.items()}
87
+ tgt_output = model(tgt=inputs)["tgt_reps"]
88
+ print(string, '=', model.compute_similarity(qry_output, tgt_output))
89
+ ## A cat and a tiger = tensor([[0.3555]], device='cuda:0', dtype=torch.bfloat16)
90
+
91
+ ```
92
+
93
+
94
+ ## Citation
95
+ ```
96
+ @article{jiang2024vlm2vec,
97
+ title={VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks},
98
+ author={Jiang, Ziyan and Meng, Rui and Yang, Xinyi and Yavuz, Semih and Zhou, Yingbo and Chen, Wenhu},
99
+ journal={arXiv preprint arXiv:2410.05160},
100
+ year={2024}
101
+ }