jina-colbert-v2 / README.md

feat: update README

770cf5b 6 months ago

10.6 kB

	---
	license: cc-by-4.0
	language:
	- multilingual
	- af
	- am
	- ar
	- as
	- az
	- be
	- bg
	- bn
	- br
	- bs
	- ca
	- cs
	- cy
	- da
	- de
	- el
	- en
	- eo
	- es
	- et
	- eu
	- fa
	- fi
	- fr
	- fy
	- ga
	- gd
	- gl
	- gu
	- ha
	- he
	- hi
	- hr
	- hu
	- hy
	- id
	- is
	- it
	- ja
	- jv
	- ka
	- kk
	- km
	- kn
	- ko
	- ku
	- ky
	- la
	- lo
	- lt
	- lv
	- mg
	- mk
	- ml
	- mn
	- mr
	- ms
	- my
	- ne
	- nl
	- 'no'
	- om
	- or
	- pa
	- pl
	- ps
	- pt
	- ro
	- ru
	- sa
	- sd
	- si
	- sk
	- sl
	- so
	- sq
	- sr
	- su
	- sv
	- sw
	- ta
	- te
	- th
	- tl
	- tr
	- ug
	- uk
	- ur
	- uz
	- vi
	- xh
	- yi
	- zh
	tags:
	- ColBERT
	- passage-retrieval
	---

	<br><br>

	<p align="center">
	<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
	</p>


	<p align="center">
	<b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
	</p>

	# Jina-ColBERT-v2
	Jina ColBERT v2 (`jina-colbert-v2`) is a new model based on the [Jina-ColBERT architecture](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/) that expands on the capabilities and performance of the `jina-colbert-v1-en` model. Like the previous release, it has Jina AI’s 8192 token input context and the [improved efficiency, performance](https://jina.ai/news/what-is-colbert-and-late-interaction-and-why-they-matter-in-search/), and [explainability](https://jina.ai/news/ai-explainability-made-easy-how-late-interaction-makes-jina-colbert-transparent/) of token-level embeddings and late interaction.

	This new release adds new functionality and performance improvements:

	- Multilingual support for dozens of languages, with strong performance on major global languages.
	- [Matryoshka embeddings](https://huggingface.co/blog/matryoshka), which allow users to trade between efficiency and precision flexibly.
	- Superior retrieval performance when compared to the English-only `jina-colbert-v1-en`.

	## Usage

	### Installation

	`jina-colbert-v2` is trained with flash attention and therefore requires `einops` and `flash_attn` to be installed.

	To use the model, you could either use the Standford ColBERT library or use the `ragatouille` package that we provide.

	```bash
	pip install -U einops flash_attn
	pip install -U ragatouille
	pip install -U colbert-ai
	```

	### RAGatouille

	```python
	from ragatouille import RAGPretrainedModel

	RAG = RAGPretrainedModel.from_pretrained("jinaai/jina-colbert-v2")

	docs = [
	"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
	"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
	]

	RAG.index(docs, index_name="demo")

	query = 'What does ColBERT do?'

	results = RAG.search(query)
	```

	### Stanford ColBERT
	Typically, you would run the following code to index using the Stanford ColBERT library on a GPU machine. Check the reference at [Stanford ColBERT](https://github.com/stanford-futuredata/ColBERT?tab=readme-ov-file#installation) for more details.

	#### Indexing

	```python
	from colbert import Indexer
	from colbert.infra import ColBERTConfig

	if __name__ == "__main__":
	config = ColBERTConfig(
	doc_maxlen=512,
	nbits=2
	)
	indexer = Indexer(
	checkpoint="jinaai/jina-colbert-v2",
	config=config,
	)
	docs = [
	"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
	"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
	]
	indexer.index(name='demo', collection=docs)
	```

	#### Searching

	```python
	from colbert import Searcher
	from colbert.infra import ColBERTConfig

	k = 10

	if __name__ == "__main__":
	config = ColBERTConfig(
	query_maxlen=128
	)
	searcher = Searcher(
	index='demo',
	config=config
	)
	query = 'What does ColBERT do?'
	results = searcher.search(query, k=k)

	```

	#### Creating vectors

	```python
	from colbert.infra import ColBERTConfig
	from colbert.modeling.checkpoint import Checkpoint

	ckpt = Checkpoint("jinaai/jina-colbert-v2", colbert_config=ColBERTConfig())
	docs = [
	"ColBERT is a novel ranking model that adapts deep LMs for efficient retrieval.",
	"Jina-ColBERT is a ColBERT-style model but based on JinaBERT so it can support both 8k context length, fast and accurate retrieval."
	]
	query_vectors = ckpt.queryFromText( docs, bsize=2)
	print(query_vectors)
	```

	## Evaluation Results

	### Retrieval Benchmarks

	#### BEIR

	\| NDCG@10 \| jina-colbert-v2 \| jina-colbert-v1 \| ColBERTv2.0 \| BM25 \|
	\|--------------------\|---------------------\|---------------------\|-----------------\|----------\|
	\| avg \| 0.531 \| 0.502 \| 0.496 \| 0.440 \|
	\| nfcorpus \| 0.346 \| 0.338 \| 0.337 \| 0.325 \|
	\| fiqa \| 0.408 \| 0.368 \| 0.354 \| 0.236 \|
	\| trec-covid \| 0.834 \| 0.750 \| 0.726 \| 0.656 \|
	\| arguana \| 0.366 \| 0.494 \| 0.465 \| 0.315 \|
	\| quora \| 0.887 \| 0.823 \| 0.855 \| 0.789 \|
	\| scidocs \| 0.186 \| 0.169 \| 0.154 \| 0.158 \|
	\| scifact \| 0.678 \| 0.701 \| 0.689 \| 0.665 \|
	\| webis-touche \| 0.274 \| 0.270 \| 0.260 \| 0.367 \|
	\| dbpedia-entity \| 0.471 \| 0.413 \| 0.452 \| 0.313 \|
	\| fever \| 0.805 \| 0.795 \| 0.785 \| 0.753 \|
	\| climate-fever \| 0.239 \| 0.196 \| 0.176 \| 0.213 \|
	\| hotpotqa \| 0.766 \| 0.656 \| 0.675 \| 0.603 \|
	\| nq \| 0.640 \| 0.549 \| 0.524 \| 0.329 \|



	#### MS MARCO Passage Retrieval

	\| MRR@10 \| jina-colbert-v2 \| jina-colbert-v1 \| ColBERTv2.0 \| BM25 \|
	\|-------------\|---------------------\|---------------------\|-----------------\|----------\|
	\| MSMARCO \| 0.396 \| 0.390 \| 0.397 \| 0.187 \|


	### Multilingual Benchmarks

	#### MIRACLE

	\| NDCG@10 \| jina-colbert-v2 \| mDPR (zero shot) \|
	\|---------\|---------------------\|----------------------\|
	\| avg \| 0.627 \| 0.427 \|
	\| ar \| 0.753 \| 0.499 \|
	\| bn \| 0.750 \| 0.443 \|
	\| de \| 0.504 \| 0.490 \|
	\| es \| 0.538 \| 0.478 \|
	\| en \| 0.570 \| 0.394 \|
	\| fa \| 0.563 \| 0.480 \|
	\| fi \| 0.740 \| 0.472 \|
	\| fr \| 0.541 \| 0.435 \|
	\| hi \| 0.600 \| 0.383 \|
	\| id \| 0.547 \| 0.272 \|
	\| ja \| 0.632 \| 0.439 \|
	\| ko \| 0.671 \| 0.419 \|
	\| ru \| 0.643 \| 0.407 \|
	\| sw \| 0.499 \| 0.299 \|
	\| te \| 0.742 \| 0.356 \|
	\| th \| 0.772 \| 0.358 \|
	\| yo \| 0.623 \| 0.396 \|
	\| zh \| 0.523 \| 0.512 \|

	#### mMARCO

	\| MRR@10 \| jina-colbert-v2 \| BM-25 \| ColBERT-XM \|
	\|------------\|---------------------\|-----------\|----------------\|
	\| avg \| 0.313 \| 0.141 \| 0.254 \|
	\| ar \| 0.272 \| 0.111 \| 0.195 \|
	\| de \| 0.331 \| 0.136 \| 0.270 \|
	\| nl \| 0.330 \| 0.140 \| 0.275 \|
	\| es \| 0.341 \| 0.158 \| 0.285 \|
	\| fr \| 0.335 \| 0.155 \| 0.269 \|
	\| hi \| 0.309 \| 0.134 \| 0.238 \|
	\| id \| 0.319 \| 0.149 \| 0.263 \|
	\| it \| 0.337 \| 0.153 \| 0.265 \|
	\| ja \| 0.276 \| 0.141 \| 0.241 \|
	\| pt \| 0.337 \| 0.152 \| 0.276 \|
	\| ru \| 0.298 \| 0.124 \| 0.251 \|
	\| vi \| 0.287 \| 0.136 \| 0.226 \|
	\| zh \| 0.302 \| \| 0.246 \|



	### Matryoshka Representation Benchmarks

	#### BEIR

	\| NDCG@10 \| dim=128 \| dim=96 \| dim=64 \|
	\|----------------\|-------------\|------------\|------------\|
	\| avg \| 0.599 \| 0.591 \| 0.589 \|
	\| nfcorpus \| 0.346 \| 0.340 \| 0.347 \|
	\| fiqa \| 0.408 \| 0.404 \| 0.404 \|
	\| trec-covid \| 0.834 \| 0.808 \| 0.805 \|
	\| hotpotqa \| 0.766 \| 0.764 \| 0.756 \|
	\| nq \| 0.640 \| 0.640 \| 0.635 \|


	#### MSMARCO

	\| MRR@10 \| dim=128 \| dim=96 \| dim=64 \|
	\|----------------\|-------------\|------------\|------------\|
	\| msmarco \| 0.396 \| 0.391 \| 0.388 \|

	## Other Models

	Additionally, we provide the following embedding models, you can also use them for retrieval.

	- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
	- [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English bilingual model.
	- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English bilingual model.
	- [`jina-embeddings-v2-base-es`](https://huggingface.co/jinaai/jina-embeddings-v2-base-es): 161 million parameters Spanish-English bilingual model.

	## Contact

	Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.