michaelfeil commited on
Commit
3614af0
·
1 Parent(s): a4d1c67

upload model with 1024

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false
7
+ }
README.md ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - feature-extraction
5
+ - sentence-similarity
6
+ - mteb
7
+ - transformers
8
+ - transformers.js
9
+ datasets:
10
+ - allenai/c4
11
+ language: en
12
+ inference: false
13
+ license: apache-2.0
14
+ ---
15
+ <!-- TODO: add evaluation results here -->
16
+ <br><br>
17
+
18
+ <p align="center">
19
+ <img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
20
+ </p>
21
+
22
+
23
+ <p align="center">
24
+ <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
25
+ </p>
26
+
27
+ ## Quick Start
28
+
29
+ The easiest way to starting using `jina-embeddings-v2-base-code` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
30
+
31
+
32
+ ## Intended Usage & Model Info
33
+
34
+ `jina-embeddings-v2-base-code` is an multilingual **embedding model** speaks **English and 30 widely used programming languages**.
35
+ Same as other jina-embeddings-v2 series, it supports **8192** sequence length.
36
+
37
+ `jina-embeddings-v2-base-code` is based on a Bert architecture (JinaBert) that supports the symmetric bidirectional variant of [ALiBi](https://arxiv.org/abs/2108.12409) to allow longer sequence length.
38
+ The backbone `jina-bert-v2-base-code` is pretrained on the [github-code](https://huggingface.co/datasets/codeparrot/github-code) dataset.
39
+ The model is further trained on Jina AI's collection of more than 150 millions of coding question answer and docstring source code pairs.
40
+ These pairs were obtained from various domains and were carefully selected through a thorough cleaning process.
41
+
42
+ The embedding model was trained using 512 sequence length, but extrapolates to 8k sequence length (or even longer) thanks to ALiBi.
43
+ This makes our model useful for a range of use cases, especially when processing long documents is needed, including technical question answering and code search.
44
+
45
+ This model has 161 million parameters, which enables fast and memory efficient inference, while delivering impressive performance.
46
+ Additionally, we provide the following embedding models:
47
+
48
+ - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
49
+ - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
50
+ - [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): Chinese-English Bilingual embeddings.
51
+ - [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): German-English Bilingual embeddings.
52
+ - [`jina-embeddings-v2-base-es`](https://huggingface.co/jinaai/jina-embeddings-v2-base-es): Spanish-English Bilingual embeddings (soon).
53
+ - [`jina-embeddings-v2-base-code`](https://huggingface.co/jinaai/jina-embeddings-v2-base-code): 161 million parameters code embeddings.
54
+
55
+ **<details><summary>Supported (Programming) Languages</summary>**
56
+ <p>
57
+
58
+ - English
59
+ - Assembly
60
+ - Batchfile
61
+ - C
62
+ - C#
63
+ - C++
64
+ - CMake
65
+ - CSS
66
+ - Dockerfile
67
+ - FORTRAN
68
+ - GO
69
+ - Haskell
70
+ - HTML
71
+ - Java
72
+ - JavaScript
73
+ - Julia
74
+ - Lua
75
+ - Makefile
76
+ - Markdown
77
+ - PHP
78
+ - Perl
79
+ - PowerShell
80
+ - Python
81
+ - Ruby
82
+ - Rust
83
+ - SQL
84
+ - Scala
85
+ - Shell
86
+ - TypeScript
87
+ - TeX
88
+ - Visual Basic
89
+ </p>
90
+ </details>
91
+
92
+ ## Data & Parameters
93
+
94
+ Jina Embeddings V2 [technical report](https://arxiv.org/abs/2310.19923)
95
+
96
+ ## Usage
97
+
98
+ **<details><summary>Please apply mean pooling when integrating the model.</summary>**
99
+ <p>
100
+
101
+ ### Why mean pooling?
102
+
103
+ `mean poooling` takes all token embeddings from model output and averaging them at sentence/paragraph level.
104
+ It has been proved to be the most effective way to produce high-quality sentence embeddings.
105
+ We offer an `encode` function to deal with this.
106
+
107
+ However, if you would like to do it without using the default `encode` function:
108
+
109
+ ```python
110
+ import torch
111
+ import torch.nn.functional as F
112
+ from transformers import AutoTokenizer, AutoModel
113
+
114
+ def mean_pooling(model_output, attention_mask):
115
+ token_embeddings = model_output[0]
116
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
117
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
118
+
119
+ sentences = [
120
+ 'How do I access the index while iterating over a sequence with a for loop?',
121
+ '# Use the built-in enumerator\nfor idx, x in enumerate(xs):\n print(idx, x)',
122
+ ]
123
+
124
+ tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-code')
125
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-code', trust_remote_code=True)
126
+
127
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
128
+
129
+ with torch.no_grad():
130
+ model_output = model(**encoded_input)
131
+
132
+ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
133
+ embeddings = F.normalize(embeddings, p=2, dim=1)
134
+ ```
135
+
136
+ </p>
137
+ </details>
138
+
139
+ You can use Jina Embedding models directly from transformers package:
140
+ ```python
141
+ !pip install transformers
142
+ from transformers import AutoModel
143
+ from numpy.linalg import norm
144
+
145
+ cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
146
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-code', trust_remote_code=True)
147
+ embeddings = model.encode(
148
+ [
149
+ 'How do I access the index while iterating over a sequence with a for loop?',
150
+ '# Use the built-in enumerator\nfor idx, x in enumerate(xs):\n print(idx, x)',
151
+ ]
152
+ )
153
+ print(cos_sim(embeddings[0], embeddings[1]))
154
+ >>> tensor([[0.7282]])
155
+ ```
156
+
157
+ If you only want to handle shorter sequence, such as 2k, pass the `max_length` parameter to the `encode` function:
158
+
159
+ ```python
160
+ embeddings = model.encode(
161
+ ['Very long ... code'],
162
+ max_length=2048
163
+ )
164
+ ```
165
+
166
+ Using the its latest release (v2.3.0) sentence-transformers also supports Jina embeddings (Please make sure that you are logged into huggingface as well):
167
+
168
+ ```python
169
+ !pip install -U sentence-transformers
170
+ from sentence_transformers import SentenceTransformer
171
+ from sentence_transformers.util import cos_sim
172
+
173
+ model = SentenceTransformer(
174
+ "jinaai/jina-embeddings-v2-base-code",
175
+ trust_remote_code=True
176
+ )
177
+
178
+ # control your input sequence length up to 8192
179
+ model.max_seq_length = 1024
180
+
181
+ embeddings = model.encode([
182
+ 'How do I access the index while iterating over a sequence with a for loop?',
183
+ '# Use the built-in enumerator\nfor idx, x in enumerate(xs):\n print(idx, x)',
184
+ ])
185
+ print(cos_sim(embeddings[0], embeddings[1]))
186
+ ```
187
+
188
+ You can also use the [Transformers.js](https://huggingface.co/docs/transformers.js) library to compute embeddings in JavaScript.
189
+ ```js
190
+ // npm i @xenova/transformers
191
+ import { pipeline, cos_sim } from '@xenova/transformers';
192
+
193
+ const extractor = await pipeline('feature-extraction', 'jinaai/jina-embeddings-v2-base-code', {
194
+ quantized: false, // Comment out this line to use the 8-bit quantized version
195
+ });
196
+
197
+ const texts = [
198
+ 'How do I access the index while iterating over a sequence with a for loop?',
199
+ '# Use the built-in enumerator\nfor idx, x in enumerate(xs):\n print(idx, x)',
200
+ ]
201
+ const embeddings = await extractor(texts, { pooling: 'mean' });
202
+
203
+ const score = cos_sim(embeddings[0].data, embeddings[1].data);
204
+ console.log(score);
205
+ // 0.7281748759529421
206
+ ```
207
+
208
+ ## Plans
209
+
210
+ 1. Bilingual embedding models supporting more European & Asian languages, including Spanish, French, Italian and Japanese.
211
+ 2. Multimodal embedding models enable Multimodal RAG applications.
212
+ 3. High-performt rerankers.
213
+
214
+ ## Contact
215
+
216
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "jinaai/jina-bert-v2-qk-post-norm",
3
+ "architectures": [
4
+ "JinaBertForMaskedLM"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.0,
7
+ "attn_implementation": "torch",
8
+ "auto_map": {
9
+ "AutoConfig": "jinaai/jina-bert-v2-qk-post-norm--configuration_bert.JinaBertConfig",
10
+ "AutoModel": "jinaai/jina-bert-v2-qk-post-norm--modeling_bert.JinaBertModel",
11
+ "AutoModelForMaskedLM": "jinaai/jina-bert-v2-qk-post-norm--modeling_bert.JinaBertForMaskedLM",
12
+ "AutoModelForSequenceClassification": "jinaai/jina-bert-v2-qk-post-norm--modeling_bert.JinaBertForSequenceClassification"
13
+ },
14
+ "classifier_dropout": null,
15
+ "emb_pooler": "mean",
16
+ "feed_forward_type": "geglu",
17
+ "gradient_checkpointing": false,
18
+ "hidden_act": "gelu",
19
+ "hidden_dropout_prob": 0.0,
20
+ "hidden_size": 768,
21
+ "initializer_range": 0.02,
22
+ "intermediate_size": 3072,
23
+ "layer_norm_eps": 1e-12,
24
+ "max_position_embeddings": 8192,
25
+ "model_max_length": 1024,
26
+ "model_type": "bert",
27
+ "num_attention_heads": 12,
28
+ "num_hidden_layers": 12,
29
+ "pad_token_id": 0,
30
+ "position_embedding_type": "alibi",
31
+ "torch_dtype": "float16",
32
+ "transformers_version": "4.35.2",
33
+ "type_vocab_size": 2,
34
+ "use_cache": true,
35
+ "vocab_size": 61056
36
+ }
generation_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "pad_token_id": 0,
4
+ "transformers_version": "4.31.0"
5
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8b53bfd4ae2cd586004a6ca4a16551b630a2a1b1d655ff1ee9be1286a1781c5b
3
+ size 321767312
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
onnx/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:63363fc178428b74620c6f3780cbc7191883fa5c7f84c0945c45eb5c4256733b
3
+ size 641517466
onnx/model_fp16.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1aafc4fcd63d2e6899e88402ff731e7c646c2e435048294a3cbc908a40d45d7c
3
+ size 321072580
onnx/model_quantized.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ed45870251c9f0cf656e78aab0d37a23489066df8a222bb1c8caf8a45f2cb16d
3
+ size 161895621
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1c28fb0a8bc930d79b2b29091674a8a0ce0e983489e88b0e863efb1ad4444b01
3
+ size 321787514
sentence_bert_config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 8192,
3
+ "do_lower_case": false,
4
+ "model_args": {"trust_remote_code": true}
5
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "bos_token": "<s>",
4
+ "clean_up_tokenization_spaces": true,
5
+ "cls_token": "<s>",
6
+ "eos_token": "</s>",
7
+ "errors": "replace",
8
+ "mask_token": {
9
+ "__type": "AddedToken",
10
+ "content": "<mask>",
11
+ "lstrip": true,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "model_max_length": 8192,
17
+ "pad_token": "<pad>",
18
+ "sep_token": "</s>",
19
+ "tokenizer_class": "RobertaTokenizer",
20
+ "trim_offsets": true,
21
+ "unk_token": "<unk>"
22
+ }
train_results.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "epoch": 1.0,
3
+ "train_loss": 0.8820991159725189,
4
+ "train_runtime": 81002.545,
5
+ "train_samples": 100000,
6
+ "train_samples_per_second": 1264.158,
7
+ "train_steps_per_second": 1.235
8
+ }
trainer_state.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.json ADDED
The diff for this file is too large to render. See raw diff