Upload 11 files

Uncased model 4 epochs trained on PA_ARXIV,PA_BOOKS and PA_JACOW where all equations, tables, special symbols and numbers were removed.

The preprocessing considerably improves results, and is currently roughly on par with the sentence transformers generally and minor improvements in individual tokens specific for PA community like BPM.

Files changed (2) hide show

README.md +5 -15
model.safetensors +1 -1

README.md CHANGED Viewed

@@ -9,16 +9,11 @@ tags:
 ---
-# PACuna Embedding : Fine-Tuned Embedding for Particle Accelerator Embedding
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
-This model is a fine-tuned word embedding model optimized for applications in particle accelerator science.
-It was trained on a large corpus of scientific literature and papers related to particle accelerators.
-This fine-tuned embedding can be used as input to downstream natural language processing tasks relevant to particle accelerator research and operations,
-such as information retrieval from logbooks.
 ## Usage (Sentence-Transformers)
@@ -89,12 +84,9 @@ For an automated evaluation of this model, see the *Sentence Embeddings Benchmar
 ## Training
 The model was trained with the parameters:
-### Dataset
-The dataset used are PA_JACOW+PA_BOOKS+PA_ARXIV. Equations, tables, MMD headings (\#), numbers and any special symbols are removed from training input data (see prepare_mmd_eqations_and_tables_for_simcse function in PA_LOGBOOKS/code/mmd.py)
 **DataLoader**:
-`torch.utils.data.dataloader.DataLoader` of length 28836 with parameters:
 ```
 {'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
 ```
@@ -106,8 +98,6 @@ The dataset used are PA_JACOW+PA_BOOKS+PA_ARXIV. Equations, tables, MMD headings
   {'scale': 20.0, 'similarity_fct': 'cos_sim'}
   ```
-The scaling parameter is based on the paper's suggestion (cos_sim(a,b) / 0.05).
 Parameters of the fit()-Method:
 ```
 {
@@ -121,8 +111,8 @@ Parameters of the fit()-Method:
     },
     "scheduler": "WarmupLinear",
     "steps_per_epoch": null,
-    "warmup_steps": 46137,
-    "weight_decay": 0.0
 }
 ```

 ---
+# {MODEL_NAME}
 This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
+<!--- Describe your model here -->
 ## Usage (Sentence-Transformers)
 ## Training
 The model was trained with the parameters:
 **DataLoader**:
+`torch.utils.data.dataloader.DataLoader` of length 25444 with parameters:
 ```
 {'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
 ```
   {'scale': 20.0, 'similarity_fct': 'cos_sim'}
   ```
 Parameters of the fit()-Method:
 ```
 {
     },
     "scheduler": "WarmupLinear",
     "steps_per_epoch": null,
+    "warmup_steps": 0.0,
+    "weight_decay": 0.01
 }
 ```

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:dcb1c51c373a276e30ef690b58943419621c9b9d5ee65e9c1cc38a5430f0917d
 size 439776096

 version https://git-lfs.github.com/spec/v1
+oid sha256:ff2d9a7b55a2a8465c79a0f39a49e753080836960026685a3abdd8fbeb16fa25
 size 439776096