thellert
/

accphysbert_cased

@@ -1,66 +1,105 @@
 # AccPhysBERT
 **AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.
 ---
 ## Model Description
 - **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
 - **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community.
 - **Notable Features**:
-  - Trained on 109 k accelerator-physics publications from INSPIRE HE P.
-  - Leverages 690 k citation pairs and 2 M synthetic query–source pairs.
-  - Trained via SentenceTransformers to produce dense, semantically rich embeddings.
 **Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
 **Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory
 **Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned)
 **Language**: English
-**License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
-**Paper**: *Domain-specific text embedding model for accelerator physics*. Hellert, Montenegro, Venturini & Pollastro, Phys. Rev. Accel. Beams 28, 044601 (2025)
 ---
 ## Training Data
-- **Core Corpus**:
-  - 109,000 accelerator-physics publications (INSPIRE HEP category: “Accelerator”)
-  - Over 1 GB of full-text markdown-style text
-- **Annotation Sources**:
   - 690,000 citation pairs
-  - 49 semantic categories labeled via ChatGPT-4o
   - 2,000,000 synthetic query–source pairs generated with LLaMA3-70B
-- **Preprocessing**:
-  - Full-text OCR extraction via Nougat, cleaned to plain-text markdown format
 ---
 ## Training Procedure
-- **Fine-tuning Method**: SimCSE
 - **Hyperparameters**:
   - Batch size: 512
-  - Learning rate: 2×10⁻⁴
-  - Temperature (τ): 0.05
   - Weight decay: 0.01
   - Optimizer: Adam
   - Epochs: 2
-  - Hardware: 32 × NVIDIA A100 GPUs @ NERSC
 - **Framework**: SentenceTransformers
 ---
 ## Evaluation Results
-| Task                        | Metric             | AccPhysBERT |
-|----------------------------|--------------------|-------------|
-| Citation Classification    | Cosine Accuracy     | 91.0%       |
-| Category Clustering        | V‑measure (main/sub)| 63.7 / **77.2** |
-| Information Retrieval (nDCG@10) | —              | 66.3        |
-AccPhysBERT consistently outperforms baseline models like BERT, SciBERT, and advanced embedding models (e.g., bge‑large‑v1.5) in accelerator-focused tasks.
 ---
@@ -73,10 +112,47 @@ import torch
 tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
 model = AutoModel.from_pretrained("thellert/accphysbert")
-text = "We report on beam instabilities observed in the LCLS‑II injector."
 inputs = tokenizer(text, return_tensors="pt")
 outputs = model(**inputs)
-# Use mean pooling (excluding [CLS] & [SEP]) for sentence embedding
 token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
-sentence_embedding = token_embeddings.mean(dim=1)

+---
+language: en
+license: cc-by-4.0
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+- transformers
+- bert
+- accelerator-physics
+- physics
+- scientific-literature
+- embeddings
+- domain-specific
+library_name: sentence-transformers
+pipeline_tag: feature-extraction
+base_model: thellert/physbert_cased
+model-index:
+- name: AccPhysBERT
+  results:
+  - task:
+      type: feature-extraction
+      name: Feature Extraction
+    dataset:
+      name: Accelerator Physics Publications
+      type: accelerator-physics
+    metrics:
+    - type: cosine_accuracy
+      value: 0.91
+      name: Citation Classification
+    - type: v_measure
+      value: 0.637
+      name: Category Clustering (main)
+    - type: ndcg_at_10
+      value: 0.663
+      name: Information Retrieval
+datasets:
+- inspire-hep
+---
 # AccPhysBERT
 **AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.
 ---
 ## Model Description
 - **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
 - **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community.
 - **Notable Features**:
+  - Trained on 109 k accelerator-physics publications from INSPIRE HEP
+  - Leverages 690 k citation pairs and 2 M synthetic query–source pairs
+  - Trained via SentenceTransformers to produce dense, semantically rich embeddings
 **Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
 **Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory
 **Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned)
 **Language**: English
+**License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
+**Paper**: *Domain-specific text embedding model for accelerator physics*, Phys. Rev. Accel. Beams 28, 044601 (2025)
+[https://doi.org/10.1103/PhysRevAccelBeams.28.044601](https://doi.org/10.1103/PhysRevAccelBeams.28.044601)
 ---
 ## Training Data
+- **Core Corpus**:
+  - 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators")
+  - Over 1 GB of full-text markdown-style text (via OCR/Nougat)
+- **Annotation Sources**:
   - 690,000 citation pairs
+  - 49 semantic categories labeled via ChatGPT-4o
   - 2,000,000 synthetic query–source pairs generated with LLaMA3-70B
 ---
 ## Training Procedure
+- **Fine-tuning Method**: SimCSE (contrastive loss)
 - **Hyperparameters**:
   - Batch size: 512
+  - Learning rate: 2e-4
+  - Temperature: 0.05
   - Weight decay: 0.01
   - Optimizer: Adam
   - Epochs: 2
+- **Infrastructure**: 32 × NVIDIA A100 GPUs @ NERSC
 - **Framework**: SentenceTransformers
 ---
 ## Evaluation Results
+| Task                        | Metric                   | Score   |
+|----------------------------|--------------------------|---------|
+| Citation Classification    | Cosine Accuracy          | 91.0%   |
+| Category Clustering        | V‑measure (main/sub)     | 63.7 / 77.2 |
+| Information Retrieval      | nDCG@10                  | 66.3    |
+AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks.
 ---
 tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
 model = AutoModel.from_pretrained("thellert/accphysbert")
+text = "We report on beam instabilities observed in the LCLS-II injector."
 inputs = tokenizer(text, return_tensors="pt")
 outputs = model(**inputs)
+# Use mean pooling (excluding [CLS] and [SEP])
 token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
+sentence_embedding = token_embeddings.mean(dim=1)
+```
+---
+## Citation
+If you use AccPhysBERT, please cite:
+```bibtex
+@article{Hellert_2025,
+  title     = {Domain-specific text embedding model for accelerator physics},
+  author    = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea},
+  journal   = {Physical Review Accelerators and Beams},
+  volume    = {28},
+  number    = {4},
+  pages     = {044601},
+  year      = {2025},
+  publisher = {American Physical Society},
+  doi       = {10.1103/PhysRevAccelBeams.28.044601},
+  url       = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601}
+}
+```
+---
+## Contact
+Thorsten Hellert
+Lawrence Berkeley National Laboratory
+📧 [email protected]
+---
+## Acknowledgments
+This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions.