am-azadi commited on
Commit
8ea63ae
·
verified ·
1 Parent(s): 29b4edc

Upload folder using huggingface_hub

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,398 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:25743
8
+ - loss:MultipleNegativesRankingLoss
9
+ base_model: WhereIsAI/UAE-Large-V1
10
+ widget:
11
+ - source_sentence: 2:00 PM Facebook ... 0.0KB/sill Arief Smansa Fadhillah Jun 9 at
12
+ 9:44 am. 89 111 60 = If in the next 2 weeks the people America who violates PSBB
13
+ will not happen corpses scattered on the streets, then I sure that the fear of
14
+ Corona is just a scam created by WHO and in Support by Mass Media.
15
+ sentences:
16
+ - MPs are entitled to a full pension after six months in office
17
+ - Photos of anti-racism demonstrations in the United States
18
+ - Wisconsin has more votes cast than registered voters.
19
+ - source_sentence: A religious festival in Jaffna... Radical Otulabban, who opposes
20
+ the ordination of children, has nothing to do with this...
21
+ sentences:
22
+ - A genuine article on Olympic female weightlifter suffering testicle injury?
23
+ - This video shows pilots demonstrating against Covid vaccines
24
+ - Photo shows distressed children at a religious ritual in Sri Lanka
25
+ - source_sentence: '← 42 CHANNEL Markus Hain... * 107.4K subscribers Pinned message
26
+ If you like my work for our freedom... 74% 22:32 KANAL Markus Haintz, Lawyer &
27
+ Fre... forwarded message By Vicky_TheRedSparrow BREAKING NEWS: The Supreme Court
28
+ of Justice in the United States decided that the Covid vaccination no vaccine
29
+ is unsafe and um must be avoided at all costs - Big Pharma and Anthony Fauci have
30
+ lost a lawsuit by Robert F. Kennedy Jr. and a group of scientists has been submitted! /breaking-news-the-supreme-court
31
+ -in-the-us-has-ruled-that-the-covid -pathogen-is-not-a-vaccine-is-unsafe -and-must-be-avoided-at-all-costs-big
32
+ -pharma-and-anthony-fauci-have-lost -a-lawsuit-filed-by-r/ Truth To Power BREAKING
33
+ NEWS: The Supreme Court In The US Has Ruled That The Covid Dathanen in Distress
34
+ & Vanaina la Llunafn MUTE OFF X 138'
35
+ sentences:
36
+ - 'USA: Supreme Court rules against corona vaccinations'
37
+ - Pakistani government appoints former army general to head medical regulatory body
38
+ - '"In Denmark, the law obliges owners of large agricultural land to plant 5% of
39
+ their land flowers for bees. In Portugal?"'
40
+ - source_sentence: MEXICO, Failed extortion in Celaya… and he came back to throw a
41
+ grenade ….
42
+ sentences:
43
+ - Attack on people in a cafe in Celaya, Mexico
44
+ - UNICEF issued guidelines for the prevention of coronavirus infections
45
+ - Image shows a road in Sri Lanka
46
+ - source_sentence: The ELN movement supported with 80 thousand dollars! That is little
47
+ money. What's wrong with that? For us, nor the FARC nor the ELN they are groups
48
+ terrorists ” revores Arauz PRISI ANDRES ARAUZLela campaign with funds from drug
49
+ traffickers and terrorists
50
+ sentences:
51
+ - Andrés Arauz said that he accepted financing from the ELN and that neither the
52
+ ELN nor the FARC are armed groups
53
+ - Holy communion banned in Toronto
54
+ - Myanmar leader gives three-fingered salute in support of Thai protesters?
55
+ pipeline_tag: sentence-similarity
56
+ library_name: sentence-transformers
57
+ ---
58
+
59
+ # SentenceTransformer based on WhereIsAI/UAE-Large-V1
60
+
61
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
62
+
63
+ ## Model Details
64
+
65
+ ### Model Description
66
+ - **Model Type:** Sentence Transformer
67
+ - **Base model:** [WhereIsAI/UAE-Large-V1](https://huggingface.co/WhereIsAI/UAE-Large-V1) <!-- at revision f4264cd240f4e46a527f9f57a70cda6c2a12d248 -->
68
+ - **Maximum Sequence Length:** 512 tokens
69
+ - **Output Dimensionality:** 1024 dimensions
70
+ - **Similarity Function:** Cosine Similarity
71
+ <!-- - **Training Dataset:** Unknown -->
72
+ <!-- - **Language:** Unknown -->
73
+ <!-- - **License:** Unknown -->
74
+
75
+ ### Model Sources
76
+
77
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
78
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
79
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
80
+
81
+ ### Full Model Architecture
82
+
83
+ ```
84
+ SentenceTransformer(
85
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
86
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
87
+ )
88
+ ```
89
+
90
+ ## Usage
91
+
92
+ ### Direct Usage (Sentence Transformers)
93
+
94
+ First install the Sentence Transformers library:
95
+
96
+ ```bash
97
+ pip install -U sentence-transformers
98
+ ```
99
+
100
+ Then you can load this model and run inference.
101
+ ```python
102
+ from sentence_transformers import SentenceTransformer
103
+
104
+ # Download from the 🤗 Hub
105
+ model = SentenceTransformer("sentence_transformers_model_id")
106
+ # Run inference
107
+ sentences = [
108
+ "The ELN movement supported with 80 thousand dollars! That is little money. What's wrong with that? For us, nor the FARC nor the ELN they are groups terrorists ” revores Arauz PRISI ANDRES ARAUZLela campaign with funds from drug traffickers and terrorists",
109
+ 'Andrés Arauz said that he accepted financing from the ELN and that neither the ELN nor the FARC are armed groups',
110
+ 'Holy communion banned in Toronto',
111
+ ]
112
+ embeddings = model.encode(sentences)
113
+ print(embeddings.shape)
114
+ # [3, 1024]
115
+
116
+ # Get the similarity scores for the embeddings
117
+ similarities = model.similarity(embeddings, embeddings)
118
+ print(similarities.shape)
119
+ # [3, 3]
120
+ ```
121
+
122
+ <!--
123
+ ### Direct Usage (Transformers)
124
+
125
+ <details><summary>Click to see the direct usage in Transformers</summary>
126
+
127
+ </details>
128
+ -->
129
+
130
+ <!--
131
+ ### Downstream Usage (Sentence Transformers)
132
+
133
+ You can finetune this model on your own dataset.
134
+
135
+ <details><summary>Click to expand</summary>
136
+
137
+ </details>
138
+ -->
139
+
140
+ <!--
141
+ ### Out-of-Scope Use
142
+
143
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
144
+ -->
145
+
146
+ <!--
147
+ ## Bias, Risks and Limitations
148
+
149
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
150
+ -->
151
+
152
+ <!--
153
+ ### Recommendations
154
+
155
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
156
+ -->
157
+
158
+ ## Training Details
159
+
160
+ ### Training Dataset
161
+
162
+ #### Unnamed Dataset
163
+
164
+ * Size: 25,743 training samples
165
+ * Columns: <code>sentence_0</code>, <code>sentence_1</code>, and <code>label</code>
166
+ * Approximate statistics based on the first 1000 samples:
167
+ | | sentence_0 | sentence_1 | label |
168
+ |:--------|:------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:--------------------------------------------------------------|
169
+ | type | string | string | float |
170
+ | details | <ul><li>min: 2 tokens</li><li>mean: 109.01 tokens</li><li>max: 512 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 18.19 tokens</li><li>max: 131 tokens</li></ul> | <ul><li>min: 1.0</li><li>mean: 1.0</li><li>max: 1.0</li></ul> |
171
+ * Samples:
172
+ | sentence_0 | sentence_1 | label |
173
+ |:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------|:-----------------|
174
+ | <code>In the coming weeks and months, You will see the bananas with more pints of normal, due to the effect of the ashes of the volcano! Don't stop buying them! It only affects the image not the taste! Crops need to be harvested so that the banana trees can come out ahead! alamy a a alam alamy</code> | <code>Canary bananas are going to have more spots than normal due to the effect of the ashes of the volcano</code> | <code>1.0</code> |
175
+ | <code>Are they canceling Title of those who are over 70 years old!? Negative certificate Electoral registry office, says I owe nothing. But at the bottom of the page. it says "unsubscribed"! Over 70s must check that everything is in order with their title. Millions of retirees can vote for Bolsonaro.</code> | <code>Population over 70 is having the voter registration canceled in 2022</code> | <code>1.0</code> |
176
+ | <code>VIN dti PHILIPPINES FDA APPROVED Honey-C H52% 18:43 itine Appemess Vinity Resistance Bus KONTRA CORONA VIRUS Let's boost our immune system!</code> | <code>Government-approved immunity booster for COVID-19 sold online</code> | <code>1.0</code> |
177
+ * Loss: [<code>MultipleNegativesRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativesrankingloss) with these parameters:
178
+ ```json
179
+ {
180
+ "scale": 20.0,
181
+ "similarity_fct": "cos_sim"
182
+ }
183
+ ```
184
+
185
+ ### Training Hyperparameters
186
+ #### Non-Default Hyperparameters
187
+
188
+ - `per_device_train_batch_size`: 2
189
+ - `per_device_eval_batch_size`: 2
190
+ - `num_train_epochs`: 1
191
+ - `multi_dataset_batch_sampler`: round_robin
192
+
193
+ #### All Hyperparameters
194
+ <details><summary>Click to expand</summary>
195
+
196
+ - `overwrite_output_dir`: False
197
+ - `do_predict`: False
198
+ - `eval_strategy`: no
199
+ - `prediction_loss_only`: True
200
+ - `per_device_train_batch_size`: 2
201
+ - `per_device_eval_batch_size`: 2
202
+ - `per_gpu_train_batch_size`: None
203
+ - `per_gpu_eval_batch_size`: None
204
+ - `gradient_accumulation_steps`: 1
205
+ - `eval_accumulation_steps`: None
206
+ - `torch_empty_cache_steps`: None
207
+ - `learning_rate`: 5e-05
208
+ - `weight_decay`: 0.0
209
+ - `adam_beta1`: 0.9
210
+ - `adam_beta2`: 0.999
211
+ - `adam_epsilon`: 1e-08
212
+ - `max_grad_norm`: 1
213
+ - `num_train_epochs`: 1
214
+ - `max_steps`: -1
215
+ - `lr_scheduler_type`: linear
216
+ - `lr_scheduler_kwargs`: {}
217
+ - `warmup_ratio`: 0.0
218
+ - `warmup_steps`: 0
219
+ - `log_level`: passive
220
+ - `log_level_replica`: warning
221
+ - `log_on_each_node`: True
222
+ - `logging_nan_inf_filter`: True
223
+ - `save_safetensors`: True
224
+ - `save_on_each_node`: False
225
+ - `save_only_model`: False
226
+ - `restore_callback_states_from_checkpoint`: False
227
+ - `no_cuda`: False
228
+ - `use_cpu`: False
229
+ - `use_mps_device`: False
230
+ - `seed`: 42
231
+ - `data_seed`: None
232
+ - `jit_mode_eval`: False
233
+ - `use_ipex`: False
234
+ - `bf16`: False
235
+ - `fp16`: False
236
+ - `fp16_opt_level`: O1
237
+ - `half_precision_backend`: auto
238
+ - `bf16_full_eval`: False
239
+ - `fp16_full_eval`: False
240
+ - `tf32`: None
241
+ - `local_rank`: 0
242
+ - `ddp_backend`: None
243
+ - `tpu_num_cores`: None
244
+ - `tpu_metrics_debug`: False
245
+ - `debug`: []
246
+ - `dataloader_drop_last`: False
247
+ - `dataloader_num_workers`: 0
248
+ - `dataloader_prefetch_factor`: None
249
+ - `past_index`: -1
250
+ - `disable_tqdm`: False
251
+ - `remove_unused_columns`: True
252
+ - `label_names`: None
253
+ - `load_best_model_at_end`: False
254
+ - `ignore_data_skip`: False
255
+ - `fsdp`: []
256
+ - `fsdp_min_num_params`: 0
257
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
258
+ - `fsdp_transformer_layer_cls_to_wrap`: None
259
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
260
+ - `deepspeed`: None
261
+ - `label_smoothing_factor`: 0.0
262
+ - `optim`: adamw_torch
263
+ - `optim_args`: None
264
+ - `adafactor`: False
265
+ - `group_by_length`: False
266
+ - `length_column_name`: length
267
+ - `ddp_find_unused_parameters`: None
268
+ - `ddp_bucket_cap_mb`: None
269
+ - `ddp_broadcast_buffers`: False
270
+ - `dataloader_pin_memory`: True
271
+ - `dataloader_persistent_workers`: False
272
+ - `skip_memory_metrics`: True
273
+ - `use_legacy_prediction_loop`: False
274
+ - `push_to_hub`: False
275
+ - `resume_from_checkpoint`: None
276
+ - `hub_model_id`: None
277
+ - `hub_strategy`: every_save
278
+ - `hub_private_repo`: None
279
+ - `hub_always_push`: False
280
+ - `gradient_checkpointing`: False
281
+ - `gradient_checkpointing_kwargs`: None
282
+ - `include_inputs_for_metrics`: False
283
+ - `include_for_metrics`: []
284
+ - `eval_do_concat_batches`: True
285
+ - `fp16_backend`: auto
286
+ - `push_to_hub_model_id`: None
287
+ - `push_to_hub_organization`: None
288
+ - `mp_parameters`:
289
+ - `auto_find_batch_size`: False
290
+ - `full_determinism`: False
291
+ - `torchdynamo`: None
292
+ - `ray_scope`: last
293
+ - `ddp_timeout`: 1800
294
+ - `torch_compile`: False
295
+ - `torch_compile_backend`: None
296
+ - `torch_compile_mode`: None
297
+ - `dispatch_batches`: None
298
+ - `split_batches`: None
299
+ - `include_tokens_per_second`: False
300
+ - `include_num_input_tokens_seen`: False
301
+ - `neftune_noise_alpha`: None
302
+ - `optim_target_modules`: None
303
+ - `batch_eval_metrics`: False
304
+ - `eval_on_start`: False
305
+ - `use_liger_kernel`: False
306
+ - `eval_use_gather_object`: False
307
+ - `average_tokens_across_devices`: False
308
+ - `prompts`: None
309
+ - `batch_sampler`: batch_sampler
310
+ - `multi_dataset_batch_sampler`: round_robin
311
+
312
+ </details>
313
+
314
+ ### Training Logs
315
+ | Epoch | Step | Training Loss |
316
+ |:------:|:-----:|:-------------:|
317
+ | 0.0388 | 500 | 0.0473 |
318
+ | 0.0777 | 1000 | 0.0264 |
319
+ | 0.1165 | 1500 | 0.0258 |
320
+ | 0.1554 | 2000 | 0.0322 |
321
+ | 0.1942 | 2500 | 0.0225 |
322
+ | 0.2331 | 3000 | 0.0318 |
323
+ | 0.2719 | 3500 | 0.036 |
324
+ | 0.3108 | 4000 | 0.0254 |
325
+ | 0.3496 | 4500 | 0.0166 |
326
+ | 0.3884 | 5000 | 0.0231 |
327
+ | 0.4273 | 5500 | 0.0268 |
328
+ | 0.4661 | 6000 | 0.0293 |
329
+ | 0.5050 | 6500 | 0.0315 |
330
+ | 0.5438 | 7000 | 0.0292 |
331
+ | 0.5827 | 7500 | 0.0308 |
332
+ | 0.6215 | 8000 | 0.0206 |
333
+ | 0.6603 | 8500 | 0.0329 |
334
+ | 0.6992 | 9000 | 0.0379 |
335
+ | 0.7380 | 9500 | 0.0133 |
336
+ | 0.7769 | 10000 | 0.0255 |
337
+ | 0.8157 | 10500 | 0.0138 |
338
+ | 0.8546 | 11000 | 0.0414 |
339
+ | 0.8934 | 11500 | 0.015 |
340
+ | 0.9323 | 12000 | 0.0234 |
341
+ | 0.9711 | 12500 | 0.0274 |
342
+
343
+
344
+ ### Framework Versions
345
+ - Python: 3.11.11
346
+ - Sentence Transformers: 3.4.1
347
+ - Transformers: 4.48.3
348
+ - PyTorch: 2.5.1+cu124
349
+ - Accelerate: 1.3.0
350
+ - Datasets: 3.3.1
351
+ - Tokenizers: 0.21.0
352
+
353
+ ## Citation
354
+
355
+ ### BibTeX
356
+
357
+ #### Sentence Transformers
358
+ ```bibtex
359
+ @inproceedings{reimers-2019-sentence-bert,
360
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
361
+ author = "Reimers, Nils and Gurevych, Iryna",
362
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
363
+ month = "11",
364
+ year = "2019",
365
+ publisher = "Association for Computational Linguistics",
366
+ url = "https://arxiv.org/abs/1908.10084",
367
+ }
368
+ ```
369
+
370
+ #### MultipleNegativesRankingLoss
371
+ ```bibtex
372
+ @misc{henderson2017efficient,
373
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
374
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
375
+ year={2017},
376
+ eprint={1705.00652},
377
+ archivePrefix={arXiv},
378
+ primaryClass={cs.CL}
379
+ }
380
+ ```
381
+
382
+ <!--
383
+ ## Glossary
384
+
385
+ *Clearly define terms in order to be accessible across audiences.*
386
+ -->
387
+
388
+ <!--
389
+ ## Model Card Authors
390
+
391
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
392
+ -->
393
+
394
+ <!--
395
+ ## Model Card Contact
396
+
397
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
398
+ -->
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "WhereIsAI/UAE-Large-V1",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 1024,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 4096,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 16,
18
+ "num_hidden_layers": 24,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.48.3",
23
+ "type_vocab_size": 2,
24
+ "use_cache": false,
25
+ "vocab_size": 30522
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.48.3",
5
+ "pytorch": "2.5.1+cu124"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": "cosine"
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:335fd2a740c13b5f6e729edbe01a092079dd9ec631f0f47bdff024235168e2ff
3
+ size 1340612432
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": null,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff