accphysbert_cased / README.md

Update README.md

d4be083 verified 2 months ago

4.87 kB

	---
	language: en
	license: cc-by-4.0
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- transformers
	- bert
	- accelerator-physics
	- physics
	- scientific-literature
	- embeddings
	- domain-specific
	library_name: sentence-transformers
	pipeline_tag: feature-extraction
	base_model: thellert/physbert_cased
	model-index:
	- name: AccPhysBERT
	results:
	- task:
	type: feature-extraction
	name: Feature Extraction
	dataset:
	name: Accelerator Physics Publications
	type: accelerator-physics
	metrics:
	- type: cosine_accuracy
	value: 0.91
	name: Citation Classification
	- type: v_measure
	value: 0.637
	name: Category Clustering (main)
	- type: ndcg_at_10
	value: 0.663
	name: Information Retrieval
	datasets:
	- inspire-hep
	---

	# AccPhysBERT

	AccPhysBERT is a specialized sentence-embedding model fine-tuned for accelerator physics, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.

	---

	## Model Description

	- Architecture: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
	- Optimized For: Titles, abstracts, proposals, and full text from the accelerator-physics community.
	- Notable Features:
	- Trained on 109 k accelerator-physics publications from INSPIRE HEP
	- Leverages 690 k citation pairs and 2 M synthetic query–source pairs
	- Trained via SentenceTransformers to produce dense, semantically rich embeddings

	Developed by: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
	Funded by: US Department of Energy, Lawrence Berkeley National Laboratory
	Model Type: Sentence embedding (BERT-based, SimCSE fine-tuned)
	Language: English
	License: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
	Paper: Domain-specific text embedding model for accelerator physics, Phys. Rev. Accel. Beams 28, 044601 (2025)
	[https://doi.org/10.1103/PhysRevAccelBeams.28.044601](https://doi.org/10.1103/PhysRevAccelBeams.28.044601)

	---

	## Training Data

	- Core Corpus:
	- 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators")
	- Over 1 GB of full-text markdown-style text (via OCR/Nougat)

	- Annotation Sources:
	- 690,000 citation pairs
	- 49 semantic categories labeled via ChatGPT-4o
	- 2,000,000 synthetic query–source pairs generated with LLaMA3-70B

	---

	## Training Procedure

	- Fine-tuning Method: SimCSE (contrastive loss)
	- Hyperparameters:
	- Batch size: 512
	- Learning rate: 2e-4
	- Temperature: 0.05
	- Weight decay: 0.01
	- Optimizer: Adam
	- Epochs: 2
	- Infrastructure: 32 × NVIDIA A100 GPUs @ NERSC
	- Framework: SentenceTransformers

	---

	## Evaluation Results

	\| Task \| Metric \| Score \|
	\|----------------------------\|--------------------------\|---------\|
	\| Citation Classification \| Cosine Accuracy \| 91.0% \|
	\| Category Clustering \| V‑measure (main/sub) \| 63.7 / 77.2 \|
	\| Information Retrieval \| nDCG@10 \| 66.3 \|

	AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks.

	---

	## Example Usage

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
	model = AutoModel.from_pretrained("thellert/accphysbert")

	text = "We report on beam instabilities observed in the LCLS-II injector."
	inputs = tokenizer(text, return_tensors="pt")
	outputs = model(**inputs)

	# Use mean pooling (excluding [CLS] and [SEP])
	token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
	sentence_embedding = token_embeddings.mean(dim=1)
	```


	---

	## Citation

	If you use AccPhysBERT, please cite:

	```bibtex
	@article{Hellert_2025,
	title = {Domain-specific text embedding model for accelerator physics},
	author = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea},
	journal = {Physical Review Accelerators and Beams},
	volume = {28},
	number = {4},
	pages = {044601},
	year = {2025},
	publisher = {American Physical Society},
	doi = {10.1103/PhysRevAccelBeams.28.044601},
	url = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601}
	}
	```

	---

	## Contact

	Thorsten Hellert
	Lawrence Berkeley National Laboratory
	📧 [email protected]

	---

	## Acknowledgments

	This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions.