sulcan commited on
Commit
5a78253
·
verified ·
1 Parent(s): 9dfc7a3

Upload 11 files

Browse files

Uncased model 4 epochs trained on PA_ARXIV,PA_BOOKS and PA_JACOW where all equations, tables, special symbols and numbers were removed.

The preprocessing considerably improves results, and is currently roughly on par with the sentence transformers generally and minor improvements in individual tokens specific for PA community like BPM.

Files changed (2) hide show
  1. README.md +5 -15
  2. model.safetensors +1 -1
README.md CHANGED
@@ -9,16 +9,11 @@ tags:
9
 
10
  ---
11
 
12
- # PACuna Embedding : Fine-Tuned Embedding for Particle Accelerator Embedding
13
 
14
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
15
 
16
- This model is a fine-tuned word embedding model optimized for applications in particle accelerator science.
17
- It was trained on a large corpus of scientific literature and papers related to particle accelerators.
18
-
19
- This fine-tuned embedding can be used as input to downstream natural language processing tasks relevant to particle accelerator research and operations,
20
- such as information retrieval from logbooks.
21
-
22
 
23
  ## Usage (Sentence-Transformers)
24
 
@@ -89,12 +84,9 @@ For an automated evaluation of this model, see the *Sentence Embeddings Benchmar
89
  ## Training
90
  The model was trained with the parameters:
91
 
92
- ### Dataset
93
- The dataset used are PA_JACOW+PA_BOOKS+PA_ARXIV. Equations, tables, MMD headings (\#), numbers and any special symbols are removed from training input data (see prepare_mmd_eqations_and_tables_for_simcse function in PA_LOGBOOKS/code/mmd.py)
94
-
95
  **DataLoader**:
96
 
97
- `torch.utils.data.dataloader.DataLoader` of length 28836 with parameters:
98
  ```
99
  {'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
100
  ```
@@ -106,8 +98,6 @@ The dataset used are PA_JACOW+PA_BOOKS+PA_ARXIV. Equations, tables, MMD headings
106
  {'scale': 20.0, 'similarity_fct': 'cos_sim'}
107
  ```
108
 
109
- The scaling parameter is based on the paper's suggestion (cos_sim(a,b) / 0.05).
110
-
111
  Parameters of the fit()-Method:
112
  ```
113
  {
@@ -121,8 +111,8 @@ Parameters of the fit()-Method:
121
  },
122
  "scheduler": "WarmupLinear",
123
  "steps_per_epoch": null,
124
- "warmup_steps": 46137,
125
- "weight_decay": 0.0
126
  }
127
  ```
128
 
 
9
 
10
  ---
11
 
12
+ # {MODEL_NAME}
13
 
14
  This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
15
 
16
+ <!--- Describe your model here -->
 
 
 
 
 
17
 
18
  ## Usage (Sentence-Transformers)
19
 
 
84
  ## Training
85
  The model was trained with the parameters:
86
 
 
 
 
87
  **DataLoader**:
88
 
89
+ `torch.utils.data.dataloader.DataLoader` of length 25444 with parameters:
90
  ```
91
  {'batch_size': 16, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
92
  ```
 
98
  {'scale': 20.0, 'similarity_fct': 'cos_sim'}
99
  ```
100
 
 
 
101
  Parameters of the fit()-Method:
102
  ```
103
  {
 
111
  },
112
  "scheduler": "WarmupLinear",
113
  "steps_per_epoch": null,
114
+ "warmup_steps": 0.0,
115
+ "weight_decay": 0.01
116
  }
117
  ```
118
 
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:dcb1c51c373a276e30ef690b58943419621c9b9d5ee65e9c1cc38a5430f0917d
3
  size 439776096
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ff2d9a7b55a2a8465c79a0f39a49e753080836960026685a3abdd8fbeb16fa25
3
  size 439776096