File size: 4,871 Bytes
d4be083 3c70686 372c480 3c70686 372c480 3c70686 372c480 3c70686 372c480 3c70686 d4be083 372c480 3c70686 d4be083 372c480 3c70686 372c480 3c70686 372c480 d4be083 3c70686 d4be083 3c70686 372c480 3c70686 372c480 3c70686 372c480 d4be083 3c70686 d4be083 3c70686 d4be083 3c70686 372c480 3c70686 372c480 3c70686 372c480 d4be083 372c480 d4be083 372c480 3c70686 372c480 3c70686 372c480 3c70686 372c480 3c70686 372c480 d4be083 3c70686 372c480 d4be083 3c70686 d4be083 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 |
---
language: en
license: cc-by-4.0
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
- bert
- accelerator-physics
- physics
- scientific-literature
- embeddings
- domain-specific
library_name: sentence-transformers
pipeline_tag: feature-extraction
base_model: thellert/physbert_cased
model-index:
- name: AccPhysBERT
results:
- task:
type: feature-extraction
name: Feature Extraction
dataset:
name: Accelerator Physics Publications
type: accelerator-physics
metrics:
- type: cosine_accuracy
value: 0.91
name: Citation Classification
- type: v_measure
value: 0.637
name: Category Clustering (main)
- type: ndcg_at_10
value: 0.663
name: Information Retrieval
datasets:
- inspire-hep
---
# AccPhysBERT
**AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.
---
## Model Description
- **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
- **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community.
- **Notable Features**:
- Trained on 109 k accelerator-physics publications from INSPIRE HEP
- Leverages 690 k citation pairs and 2 M synthetic query–source pairs
- Trained via SentenceTransformers to produce dense, semantically rich embeddings
**Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
**Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory
**Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned)
**Language**: English
**License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
**Paper**: *Domain-specific text embedding model for accelerator physics*, Phys. Rev. Accel. Beams 28, 044601 (2025)
[https://doi.org/10.1103/PhysRevAccelBeams.28.044601](https://doi.org/10.1103/PhysRevAccelBeams.28.044601)
---
## Training Data
- **Core Corpus**:
- 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators")
- Over 1 GB of full-text markdown-style text (via OCR/Nougat)
- **Annotation Sources**:
- 690,000 citation pairs
- 49 semantic categories labeled via ChatGPT-4o
- 2,000,000 synthetic query–source pairs generated with LLaMA3-70B
---
## Training Procedure
- **Fine-tuning Method**: SimCSE (contrastive loss)
- **Hyperparameters**:
- Batch size: 512
- Learning rate: 2e-4
- Temperature: 0.05
- Weight decay: 0.01
- Optimizer: Adam
- Epochs: 2
- **Infrastructure**: 32 × NVIDIA A100 GPUs @ NERSC
- **Framework**: SentenceTransformers
---
## Evaluation Results
| Task | Metric | Score |
|----------------------------|--------------------------|---------|
| Citation Classification | Cosine Accuracy | 91.0% |
| Category Clustering | V‑measure (main/sub) | 63.7 / 77.2 |
| Information Retrieval | nDCG@10 | 66.3 |
AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks.
---
## Example Usage
```python
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
model = AutoModel.from_pretrained("thellert/accphysbert")
text = "We report on beam instabilities observed in the LCLS-II injector."
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
# Use mean pooling (excluding [CLS] and [SEP])
token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)
```
---
## Citation
If you use AccPhysBERT, please cite:
```bibtex
@article{Hellert_2025,
title = {Domain-specific text embedding model for accelerator physics},
author = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea},
journal = {Physical Review Accelerators and Beams},
volume = {28},
number = {4},
pages = {044601},
year = {2025},
publisher = {American Physical Society},
doi = {10.1103/PhysRevAccelBeams.28.044601},
url = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601}
}
```
---
## Contact
Thorsten Hellert
Lawrence Berkeley National Laboratory
📧 [email protected]
---
## Acknowledgments
This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions. |