File size: 4,871 Bytes
d4be083
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c70686
372c480
3c70686
372c480
3c70686
372c480
3c70686
372c480
3c70686
 
 
d4be083
 
 
372c480
3c70686
 
 
 
d4be083
 
 
372c480
3c70686
372c480
3c70686
372c480
d4be083
 
 
 
 
3c70686
d4be083
3c70686
372c480
3c70686
372c480
3c70686
372c480
d4be083
3c70686
 
d4be083
 
3c70686
 
 
d4be083
3c70686
372c480
3c70686
372c480
3c70686
372c480
d4be083
 
 
 
 
372c480
d4be083
372c480
3c70686
372c480
3c70686
372c480
3c70686
 
 
372c480
3c70686
 
372c480
d4be083
3c70686
 
372c480
d4be083
3c70686
d4be083
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
language: en
license: cc-by-4.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- bert
- accelerator-physics
- physics
- scientific-literature
- embeddings
- domain-specific
library_name: sentence-transformers
pipeline_tag: feature-extraction
base_model: thellert/physbert_cased
model-index:
- name: AccPhysBERT
  results:
  - task:
      type: feature-extraction
      name: Feature Extraction
    dataset:
      name: Accelerator Physics Publications
      type: accelerator-physics
    metrics:
    - type: cosine_accuracy
      value: 0.91
      name: Citation Classification
    - type: v_measure
      value: 0.637
      name: Category Clustering (main)
    - type: ndcg_at_10
      value: 0.663
      name: Information Retrieval
datasets:
- inspire-hep
---

# AccPhysBERT

**AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.

---

## Model Description

- **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
- **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community.
- **Notable Features**:
  - Trained on 109 k accelerator-physics publications from INSPIRE HEP
  - Leverages 690 k citation pairs and 2 M synthetic query–source pairs
  - Trained via SentenceTransformers to produce dense, semantically rich embeddings

**Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro  
**Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory  
**Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned)  
**Language**: English  
**License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)  
**Paper**: *Domain-specific text embedding model for accelerator physics*, Phys. Rev. Accel. Beams 28, 044601 (2025)  
[https://doi.org/10.1103/PhysRevAccelBeams.28.044601](https://doi.org/10.1103/PhysRevAccelBeams.28.044601)

---

## Training Data

- **Core Corpus**:  
  - 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators")  
  - Over 1 GB of full-text markdown-style text (via OCR/Nougat)

- **Annotation Sources**:  
  - 690,000 citation pairs  
  - 49 semantic categories labeled via ChatGPT-4o  
  - 2,000,000 synthetic query–source pairs generated with LLaMA3-70B

---

## Training Procedure

- **Fine-tuning Method**: SimCSE (contrastive loss)
- **Hyperparameters**:
  - Batch size: 512  
  - Learning rate: 2e-4  
  - Temperature: 0.05  
  - Weight decay: 0.01  
  - Optimizer: Adam  
  - Epochs: 2  
- **Infrastructure**: 32 × NVIDIA A100 GPUs @ NERSC  
- **Framework**: SentenceTransformers

---

## Evaluation Results

| Task                        | Metric                   | Score   |
|----------------------------|--------------------------|---------|
| Citation Classification    | Cosine Accuracy          | 91.0%   |
| Category Clustering        | V‑measure (main/sub)     | 63.7 / 77.2 |
| Information Retrieval      | nDCG@10                  | 66.3    |

AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks.

---

## Example Usage

```python
from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
model = AutoModel.from_pretrained("thellert/accphysbert")

text = "We report on beam instabilities observed in the LCLS-II injector."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

# Use mean pooling (excluding [CLS] and [SEP])
token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)
```


---

## Citation

If you use AccPhysBERT, please cite:

```bibtex
@article{Hellert_2025,
  title     = {Domain-specific text embedding model for accelerator physics},
  author    = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea},
  journal   = {Physical Review Accelerators and Beams},
  volume    = {28},
  number    = {4},
  pages     = {044601},
  year      = {2025},
  publisher = {American Physical Society},
  doi       = {10.1103/PhysRevAccelBeams.28.044601},
  url       = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601}
}
```

---

## Contact

Thorsten Hellert  
Lawrence Berkeley National Laboratory  
📧 [email protected]

---

## Acknowledgments

This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions.