Riddler2024 commited on
Commit
a0e27f3
·
verified ·
1 Parent(s): b703057

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +128 -0
README.md ADDED
@@ -0,0 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model:
3
+ - vidore/colqwen2.5omni-base
4
+ license: mit
5
+ library_name: colpali
6
+ language:
7
+ - en
8
+ tags:
9
+ - colpali
10
+ - vidore
11
+ - vidore-experimental
12
+ pipeline_tag: visual-document-retrieval
13
+ ---
14
+
15
+ # ColQwen2.5-Omni: Visual+Audio Retriever based on Qwen2.5-Omni-3B-Instruct with ColBERT strategy
16
+
17
+ Check out the release [blogpost](https://huggingface.co/blog/manu/colqwen-omni-omnimodal-retrieval) for in-depth explanations and tutorials!
18
+
19
+ ColQwen-Omni is a model based on a novel model architecture and training strategy based on Omnimodal Language Models to efficiently index documents from their visual features.
20
+ It is a Qwen2.5-Omni-3B extension that generates [ColBERT](https://arxiv.org/abs/2004.12832)- style multi-vector representations of text and images.
21
+ It was introduced in the paper [ColPali: Efficient Document Retrieval with Vision Language Models](https://arxiv.org/abs/2407.01449) and first released in [this repository](https://github.com/ManuelFay/colpali)
22
+
23
+ <p align="center"><img width=800 src="https://github.com/illuin-tech/colpali/blob/main/assets/colpali_architecture.webp?raw=true"/></p>
24
+
25
+ ## Version specificity
26
+
27
+
28
+ This model takes dynamic image resolutions in input and does not resize them, changing their aspect ratio as in ColPali.
29
+ Maximal resolution is set so that 1024 image patches are created at most. Experiments show clear improvements with larger amounts of image patches, at the cost of memory requirements.
30
+
31
+ This version is trained with `colpali-engine==0.3.11`.
32
+
33
+ Data is the same as the ColPali data described in the paper.
34
+
35
+
36
+ ## Model Training
37
+
38
+ ### Dataset
39
+
40
+ The audio retrieval capabilities are acquired in a 0-shot capacity, as the entire training data is purely image-text matching. Yhe audio and vision tower are frozen during training.
41
+
42
+ Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%).
43
+ Our training set is fully English by design, enabling us to study zero-shot generalization to non-English languages. We explicitly verify no multi-page PDF document is used both [*ViDoRe*](https://huggingface.co/collections/vidore/vidore-benchmark-667173f98e70a1c0fa4db00d) and in the train set to prevent evaluation contamination.
44
+ A validation set is created with 2% of the samples to tune hyperparameters.
45
+
46
+ *Note: Multilingual data is present in the pretraining corpus of the language model and most probably in the multimodal training.*
47
+
48
+ ## Usage
49
+
50
+ Make sure `colpali-engine` is installed from source or with a version superior to 0.3.11.
51
+
52
+ ```bash
53
+ pip install git+https://github.com/illuin-tech/colpali
54
+ ```
55
+
56
+ ```python
57
+
58
+ import torch
59
+ from PIL import Image
60
+ from transformers.utils.import_utils import is_flash_attn_2_available
61
+ from tqdm import tqdm
62
+ from torch.utils.data import DataLoader
63
+
64
+ from colpali_engine.models import ColQwen2_5Omni, ColQwen2_5OmniProcessor
65
+
66
+ model = ColQwen2_5Omni.from_pretrained(
67
+ "vidore/colqwen-omni-v0.1",
68
+ torch_dtype=torch.bfloat16,
69
+ device_map="cuda", # or "mps" if on Apple Silicon
70
+ attn_implementation="flash_attention_2" # if is_flash_attn_2_available() else None,
71
+ ).eval()
72
+ processor = ColQwen2_5OmniProcessor.from_pretrained("vidore/colqwen-omni-v0.1")
73
+
74
+ dataset = load_dataset("eustlb/dailytalk-conversations-grouped", split="train[:500]")
75
+ audios = [x["array"] for x in dataset["audio"]]
76
+
77
+
78
+ dataloader = DataLoader(
79
+ dataset=audios,
80
+ batch_size=2,
81
+ shuffle=False,
82
+ collate_fn=lambda x: processor.process_audios(x),
83
+ )
84
+
85
+ ds = []
86
+ for batch_doc in tqdm(dataloader):
87
+ with torch.no_grad():
88
+ batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
89
+ embeddings_doc = model(**batch_doc)
90
+ ds.extend(list(torch.unbind(embeddings_doc.to("cpu"))))
91
+
92
+ def get_results(query: str, k=10):
93
+ batch_queries = processor.process_queries([query]).to(model.device)
94
+
95
+ # Forward pass
96
+ with torch.no_grad():
97
+ query_embeddings = model(**batch_queries)
98
+
99
+ scores = processor.score_multi_vector(query_embeddings, ds)
100
+ # get top-5 scores
101
+ return scores[0].topk(k).indices.tolist()
102
+
103
+ res = get_results("A person looking for a taxi")
104
+
105
+ # In colab
106
+ display(Audio(dataset[res[0]]["audio"]["array"], autoplay=True, rate=dataset[res[0]]["audio"]["sampling_rate"]))
107
+ ```
108
+
109
+ ## Contact
110
+
111
+ - Manuel Faysse: [email protected]
112
+ - Antonio Loison: [email protected]
113
+
114
+ ## Citation
115
+
116
+ If you use any datasets or models from this organization in your research, please cite the original dataset as follows:
117
+
118
+ ```bibtex
119
+ @misc{faysse2024colpaliefficientdocumentretrieval,
120
+ title={ColPali: Efficient Document Retrieval with Vision Language Models},
121
+ author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo},
122
+ year={2024},
123
+ eprint={2407.01449},
124
+ archivePrefix={arXiv},
125
+ primaryClass={cs.IR},
126
+ url={https://arxiv.org/abs/2407.01449},
127
+ }
128
+ ```