Update README.md
Browse files
README.md
CHANGED
@@ -1,66 +1,105 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# AccPhysBERT
|
2 |
|
3 |
**AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.
|
4 |
|
5 |
---
|
6 |
|
7 |
-
|
8 |
## Model Description
|
9 |
|
10 |
- **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
|
11 |
- **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community.
|
12 |
- **Notable Features**:
|
13 |
-
- Trained on 109
|
14 |
-
- Leverages 690
|
15 |
-
- Trained via SentenceTransformers to produce dense, semantically rich embeddings
|
16 |
|
17 |
**Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
|
18 |
**Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory
|
19 |
**Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned)
|
20 |
**Language**: English
|
21 |
-
**License**: [CC
|
22 |
-
**Paper**: *Domain-specific text embedding model for accelerator physics
|
|
|
23 |
|
24 |
---
|
25 |
|
26 |
## Training Data
|
27 |
|
28 |
-
- **Core Corpus**:
|
29 |
-
- 109,000 accelerator-physics publications (INSPIRE HEP category:
|
30 |
-
- Over 1
|
31 |
-
|
|
|
32 |
- 690,000 citation pairs
|
33 |
-
- 49 semantic categories labeled via ChatGPT-4o
|
34 |
- 2,000,000 synthetic query–source pairs generated with LLaMA3-70B
|
35 |
-
- **Preprocessing**:
|
36 |
-
- Full-text OCR extraction via Nougat, cleaned to plain-text markdown format
|
37 |
|
38 |
---
|
39 |
|
40 |
## Training Procedure
|
41 |
|
42 |
-
- **Fine-tuning Method**: SimCSE
|
43 |
- **Hyperparameters**:
|
44 |
- Batch size: 512
|
45 |
-
- Learning rate:
|
46 |
-
- Temperature
|
47 |
- Weight decay: 0.01
|
48 |
- Optimizer: Adam
|
49 |
- Epochs: 2
|
50 |
-
|
51 |
- **Framework**: SentenceTransformers
|
52 |
|
53 |
---
|
54 |
|
55 |
## Evaluation Results
|
56 |
|
57 |
-
| Task | Metric
|
58 |
-
|
59 |
-
| Citation Classification | Cosine Accuracy
|
60 |
-
| Category Clustering | V‑measure (main/sub)| 63.7 /
|
61 |
-
| Information Retrieval
|
62 |
|
63 |
-
AccPhysBERT
|
64 |
|
65 |
---
|
66 |
|
@@ -73,10 +112,47 @@ import torch
|
|
73 |
tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
|
74 |
model = AutoModel.from_pretrained("thellert/accphysbert")
|
75 |
|
76 |
-
text = "We report on beam instabilities observed in the LCLS
|
77 |
inputs = tokenizer(text, return_tensors="pt")
|
78 |
outputs = model(**inputs)
|
79 |
|
80 |
-
# Use mean pooling (excluding [CLS]
|
81 |
token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
|
82 |
-
sentence_embedding = token_embeddings.mean(dim=1)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: en
|
3 |
+
license: cc-by-4.0
|
4 |
+
tags:
|
5 |
+
- sentence-transformers
|
6 |
+
- feature-extraction
|
7 |
+
- sentence-similarity
|
8 |
+
- transformers
|
9 |
+
- bert
|
10 |
+
- accelerator-physics
|
11 |
+
- physics
|
12 |
+
- scientific-literature
|
13 |
+
- embeddings
|
14 |
+
- domain-specific
|
15 |
+
library_name: sentence-transformers
|
16 |
+
pipeline_tag: feature-extraction
|
17 |
+
base_model: thellert/physbert_cased
|
18 |
+
model-index:
|
19 |
+
- name: AccPhysBERT
|
20 |
+
results:
|
21 |
+
- task:
|
22 |
+
type: feature-extraction
|
23 |
+
name: Feature Extraction
|
24 |
+
dataset:
|
25 |
+
name: Accelerator Physics Publications
|
26 |
+
type: accelerator-physics
|
27 |
+
metrics:
|
28 |
+
- type: cosine_accuracy
|
29 |
+
value: 0.91
|
30 |
+
name: Citation Classification
|
31 |
+
- type: v_measure
|
32 |
+
value: 0.637
|
33 |
+
name: Category Clustering (main)
|
34 |
+
- type: ndcg_at_10
|
35 |
+
value: 0.663
|
36 |
+
name: Information Retrieval
|
37 |
+
datasets:
|
38 |
+
- inspire-hep
|
39 |
+
---
|
40 |
+
|
41 |
# AccPhysBERT
|
42 |
|
43 |
**AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.
|
44 |
|
45 |
---
|
46 |
|
|
|
47 |
## Model Description
|
48 |
|
49 |
- **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
|
50 |
- **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community.
|
51 |
- **Notable Features**:
|
52 |
+
- Trained on 109 k accelerator-physics publications from INSPIRE HEP
|
53 |
+
- Leverages 690 k citation pairs and 2 M synthetic query–source pairs
|
54 |
+
- Trained via SentenceTransformers to produce dense, semantically rich embeddings
|
55 |
|
56 |
**Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
|
57 |
**Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory
|
58 |
**Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned)
|
59 |
**Language**: English
|
60 |
+
**License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
61 |
+
**Paper**: *Domain-specific text embedding model for accelerator physics*, Phys. Rev. Accel. Beams 28, 044601 (2025)
|
62 |
+
[https://doi.org/10.1103/PhysRevAccelBeams.28.044601](https://doi.org/10.1103/PhysRevAccelBeams.28.044601)
|
63 |
|
64 |
---
|
65 |
|
66 |
## Training Data
|
67 |
|
68 |
+
- **Core Corpus**:
|
69 |
+
- 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators")
|
70 |
+
- Over 1 GB of full-text markdown-style text (via OCR/Nougat)
|
71 |
+
|
72 |
+
- **Annotation Sources**:
|
73 |
- 690,000 citation pairs
|
74 |
+
- 49 semantic categories labeled via ChatGPT-4o
|
75 |
- 2,000,000 synthetic query–source pairs generated with LLaMA3-70B
|
|
|
|
|
76 |
|
77 |
---
|
78 |
|
79 |
## Training Procedure
|
80 |
|
81 |
+
- **Fine-tuning Method**: SimCSE (contrastive loss)
|
82 |
- **Hyperparameters**:
|
83 |
- Batch size: 512
|
84 |
+
- Learning rate: 2e-4
|
85 |
+
- Temperature: 0.05
|
86 |
- Weight decay: 0.01
|
87 |
- Optimizer: Adam
|
88 |
- Epochs: 2
|
89 |
+
- **Infrastructure**: 32 × NVIDIA A100 GPUs @ NERSC
|
90 |
- **Framework**: SentenceTransformers
|
91 |
|
92 |
---
|
93 |
|
94 |
## Evaluation Results
|
95 |
|
96 |
+
| Task | Metric | Score |
|
97 |
+
|----------------------------|--------------------------|---------|
|
98 |
+
| Citation Classification | Cosine Accuracy | 91.0% |
|
99 |
+
| Category Clustering | V‑measure (main/sub) | 63.7 / 77.2 |
|
100 |
+
| Information Retrieval | nDCG@10 | 66.3 |
|
101 |
|
102 |
+
AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks.
|
103 |
|
104 |
---
|
105 |
|
|
|
112 |
tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
|
113 |
model = AutoModel.from_pretrained("thellert/accphysbert")
|
114 |
|
115 |
+
text = "We report on beam instabilities observed in the LCLS-II injector."
|
116 |
inputs = tokenizer(text, return_tensors="pt")
|
117 |
outputs = model(**inputs)
|
118 |
|
119 |
+
# Use mean pooling (excluding [CLS] and [SEP])
|
120 |
token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
|
121 |
+
sentence_embedding = token_embeddings.mean(dim=1)
|
122 |
+
```
|
123 |
+
|
124 |
+
|
125 |
+
---
|
126 |
+
|
127 |
+
## Citation
|
128 |
+
|
129 |
+
If you use AccPhysBERT, please cite:
|
130 |
+
|
131 |
+
```bibtex
|
132 |
+
@article{Hellert_2025,
|
133 |
+
title = {Domain-specific text embedding model for accelerator physics},
|
134 |
+
author = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea},
|
135 |
+
journal = {Physical Review Accelerators and Beams},
|
136 |
+
volume = {28},
|
137 |
+
number = {4},
|
138 |
+
pages = {044601},
|
139 |
+
year = {2025},
|
140 |
+
publisher = {American Physical Society},
|
141 |
+
doi = {10.1103/PhysRevAccelBeams.28.044601},
|
142 |
+
url = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601}
|
143 |
+
}
|
144 |
+
```
|
145 |
+
|
146 |
+
---
|
147 |
+
|
148 |
+
## Contact
|
149 |
+
|
150 |
+
Thorsten Hellert
|
151 |
+
Lawrence Berkeley National Laboratory
|
152 | |
153 |
+
|
154 |
+
---
|
155 |
+
|
156 |
+
## Acknowledgments
|
157 |
+
|
158 |
+
This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions.
|