File size: 3,203 Bytes
7d13513 859908c a64467d e2540d8 a64467d 55df886 d477b17 a3e3433 f5bccc2 95f65b5 f5bccc2 95f65b5 bc9ceaa 576592b bc9ceaa f5bccc2 318fe06 f5bccc2 bc9ceaa 576592b 033acd4 d099fb8 944dbdf bc9ceaa 033acd4 859908c 5037733 bc9ceaa 033acd4 0cd1359 24287da 5037733 bc9ceaa 576592b dea307d 576592b d477b17 dea307d d477b17 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 |
---
license: apache-2.0
language:
- en
inference: false
---
<br><br>
<p align="center">
<img src="https://github.com/jina-ai/finetuner/blob/main/docs/_static/finetuner-logo-ani.svg?raw=true" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
</p>
<p align="center">
<b>The text embedding suit trained by Jina AI, Finetuner team.</b>
</p>
## Intented Usage & Model Info
`jina-embedding-s-en-v1` is a language model that has been trained using Jina AI's Linnaeus-Clean dataset.
This dataset consists of 380 million pairs of sentences, which include both query-document pairs.
These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
The Linnaeus-Full dataset, from which the Linnaeus-Clean dataset is derived, originally contained 1.6 billion sentence pairs.
The model has a range of use cases, including information retrieval, semantic textual similarity, text reranking, and more.
With a compact size of just 35 million parameters,
the model enables lightning-fast inference while still delivering impressive performance.
Additionally, we provide the following options:
- `jina-embedding-s-en-v1`: 35 million parameters **(you are here)**.
- `jina-embedding-b-en-v1`: 110 million parameters.
- `jina-embedding-l-en-v1`: 330 million parameters.
- `jina-embedding-1b-en-v1`: 1.2 billion parameters, 10* bert-base size (soon).
- `jina-embedding-6b-en-v1`: 6 billion parameters 30* bert-base size(soon).
## Data & Parameters
More info will be released together with the technique report.
## Metrics
We compared the model against `all-minilm-l6-v2`/`all-mpnet-base-v2` from sbert and `text-embeddings-ada-002` from OpenAI:
|Name|param |context|
|------------------------------|-----|------|
|all-minilm-l6-v2|33m |128|
|all-mpnet-base-v2 |110m |128|
|ada-embedding-002|Unknown/API based |8192|
|jina-embedding-s-en-v1|35m |512|
|jina-embedding-b-en-v1|110m |512|
|jina-embedding-l-en-v1|330m |512|
|Name|STS12|STS13|STS14|STS15|STS16|STS17|TRECOVID|Quora|SciFact|
|------------------------------|-----|-----|-----|-----|-----|-----|--------|-----|-----|
|all-minilm-l6-v2|0.724|0.806|0.756|0.854|0.79 |0.876|0.473 |0.876|0.645 |
|all-mpnet--base-v2|0.726|0.835|0.78 |0.857|0.8 |0.906|0.513 |0.875|0.656 |
|ada-embedding-002|0.698|0.833|0.761|0.861|0.86 |0.903|0.685 |0.876|0.726 |
|jina-embedding-s-en-v1|0.738|0.781|0.732|0.833|0.785|0.859|0.471 |0.852|0.567 |
|jina-embedding-b-en-v1|0.736|0.804|0.745|0.844|0.793|0.873|0.481 |0.87|0.616 |
|jina-embedding-l-en-v1|0.735|0.829|0.759|0.844|0.8|0.888|0.465 |0.876|0.645 |
For more tasks and metrics, please checkout [MTEB](https://huggingface.co/spaces/mteb/leaderboard) benchmark.
## Usage [WIP]
```python
!pip install finetuner[text]
import finetuner
model = finetuner.get_model('jinaai/jina-embedding-s-en-v1')
embeddings = model.encode(['sentence 1', 'sentence 2'])
```
## Fine-tuning [WIP]
Please consider [Finetuner](https://github.com/jina-ai/finetuner). |