|
--- |
|
language: en |
|
license: cc-by-4.0 |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
- bert |
|
- accelerator-physics |
|
- physics |
|
- scientific-literature |
|
- embeddings |
|
- domain-specific |
|
library_name: sentence-transformers |
|
pipeline_tag: feature-extraction |
|
base_model: thellert/physbert_cased |
|
model-index: |
|
- name: AccPhysBERT |
|
results: |
|
- task: |
|
type: feature-extraction |
|
name: Feature Extraction |
|
dataset: |
|
name: Accelerator Physics Publications |
|
type: accelerator-physics |
|
metrics: |
|
- type: cosine_accuracy |
|
value: 0.91 |
|
name: Citation Classification |
|
- type: v_measure |
|
value: 0.637 |
|
name: Category Clustering (main) |
|
- type: ndcg_at_10 |
|
value: 0.663 |
|
name: Information Retrieval |
|
datasets: |
|
- inspire-hep |
|
--- |
|
|
|
# AccPhysBERT |
|
|
|
**AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature. |
|
|
|
--- |
|
|
|
## Model Description |
|
|
|
- **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE). |
|
- **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community. |
|
- **Notable Features**: |
|
- Trained on 109 k accelerator-physics publications from INSPIRE HEP |
|
- Leverages 690 k citation pairs and 2 M synthetic query–source pairs |
|
- Trained via SentenceTransformers to produce dense, semantically rich embeddings |
|
|
|
**Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro |
|
**Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory |
|
**Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned) |
|
**Language**: English |
|
**License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/) |
|
**Paper**: *Domain-specific text embedding model for accelerator physics*, Phys. Rev. Accel. Beams 28, 044601 (2025) |
|
[https://doi.org/10.1103/PhysRevAccelBeams.28.044601](https://doi.org/10.1103/PhysRevAccelBeams.28.044601) |
|
|
|
--- |
|
|
|
## Training Data |
|
|
|
- **Core Corpus**: |
|
- 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators") |
|
- Over 1 GB of full-text markdown-style text (via OCR/Nougat) |
|
|
|
- **Annotation Sources**: |
|
- 690,000 citation pairs |
|
- 49 semantic categories labeled via ChatGPT-4o |
|
- 2,000,000 synthetic query–source pairs generated with LLaMA3-70B |
|
|
|
--- |
|
|
|
## Training Procedure |
|
|
|
- **Fine-tuning Method**: SimCSE (contrastive loss) |
|
- **Hyperparameters**: |
|
- Batch size: 512 |
|
- Learning rate: 2e-4 |
|
- Temperature: 0.05 |
|
- Weight decay: 0.01 |
|
- Optimizer: Adam |
|
- Epochs: 2 |
|
- **Infrastructure**: 32 × NVIDIA A100 GPUs @ NERSC |
|
- **Framework**: SentenceTransformers |
|
|
|
--- |
|
|
|
## Evaluation Results |
|
|
|
| Task | Metric | Score | |
|
|----------------------------|--------------------------|---------| |
|
| Citation Classification | Cosine Accuracy | 91.0% | |
|
| Category Clustering | V‑measure (main/sub) | 63.7 / 77.2 | |
|
| Information Retrieval | nDCG@10 | 66.3 | |
|
|
|
AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks. |
|
|
|
--- |
|
|
|
## Example Usage |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert") |
|
model = AutoModel.from_pretrained("thellert/accphysbert") |
|
|
|
text = "We report on beam instabilities observed in the LCLS-II injector." |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
|
|
# Use mean pooling (excluding [CLS] and [SEP]) |
|
token_embeddings = outputs.last_hidden_state[:, 1:-1, :] |
|
sentence_embedding = token_embeddings.mean(dim=1) |
|
``` |
|
|
|
|
|
--- |
|
|
|
## Citation |
|
|
|
If you use AccPhysBERT, please cite: |
|
|
|
```bibtex |
|
@article{Hellert_2025, |
|
title = {Domain-specific text embedding model for accelerator physics}, |
|
author = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea}, |
|
journal = {Physical Review Accelerators and Beams}, |
|
volume = {28}, |
|
number = {4}, |
|
pages = {044601}, |
|
year = {2025}, |
|
publisher = {American Physical Society}, |
|
doi = {10.1103/PhysRevAccelBeams.28.044601}, |
|
url = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601} |
|
} |
|
``` |
|
|
|
--- |
|
|
|
## Contact |
|
|
|
Thorsten Hellert |
|
Lawrence Berkeley National Laboratory |
|
📧 [email protected] |
|
|
|
--- |
|
|
|
## Acknowledgments |
|
|
|
This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions. |