thellert commited on
Commit
d4be083
·
verified ·
1 Parent(s): 3c70686

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -26
README.md CHANGED
@@ -1,66 +1,105 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  # AccPhysBERT
2
 
3
  **AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.
4
 
5
  ---
6
 
7
-
8
  ## Model Description
9
 
10
  - **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
11
  - **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community.
12
  - **Notable Features**:
13
- - Trained on 109k accelerator-physics publications from INSPIRE HE P.
14
- - Leverages 690k citation pairs and 2M synthetic query–source pairs.
15
- - Trained via SentenceTransformers to produce dense, semantically rich embeddings.
16
 
17
  **Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
18
  **Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory
19
  **Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned)
20
  **Language**: English
21
- **License**: [CCBY4.0](https://creativecommons.org/licenses/by/4.0/)
22
- **Paper**: *Domain-specific text embedding model for accelerator physics*. Hellert, Montenegro, Venturini & Pollastro, Phys. Rev. Accel. Beams 28, 044601 (2025)
 
23
 
24
  ---
25
 
26
  ## Training Data
27
 
28
- - **Core Corpus**:
29
- - 109,000 accelerator-physics publications (INSPIRE HEP category: “Accelerator”)
30
- - Over 1GB of full-text markdown-style text
31
- - **Annotation Sources**:
 
32
  - 690,000 citation pairs
33
- - 49 semantic categories labeled via ChatGPT-4o
34
  - 2,000,000 synthetic query–source pairs generated with LLaMA3-70B
35
- - **Preprocessing**:
36
- - Full-text OCR extraction via Nougat, cleaned to plain-text markdown format
37
 
38
  ---
39
 
40
  ## Training Procedure
41
 
42
- - **Fine-tuning Method**: SimCSE
43
  - **Hyperparameters**:
44
  - Batch size: 512
45
- - Learning rate: 2×10⁻⁴
46
- - Temperature (τ): 0.05
47
  - Weight decay: 0.01
48
  - Optimizer: Adam
49
  - Epochs: 2
50
- - Hardware: 32 × NVIDIA A100 GPUs @ NERSC
51
  - **Framework**: SentenceTransformers
52
 
53
  ---
54
 
55
  ## Evaluation Results
56
 
57
- | Task | Metric | AccPhysBERT |
58
- |----------------------------|--------------------|-------------|
59
- | Citation Classification | Cosine Accuracy | 91.0% |
60
- | Category Clustering | V‑measure (main/sub)| 63.7 / **77.2** |
61
- | Information Retrieval (nDCG@10) | — | 66.3 |
62
 
63
- AccPhysBERT consistently outperforms baseline models like BERT, SciBERT, and advanced embedding models (e.g., bge‑large‑v1.5) in accelerator-focused tasks.
64
 
65
  ---
66
 
@@ -73,10 +112,47 @@ import torch
73
  tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
74
  model = AutoModel.from_pretrained("thellert/accphysbert")
75
 
76
- text = "We report on beam instabilities observed in the LCLSII injector."
77
  inputs = tokenizer(text, return_tensors="pt")
78
  outputs = model(**inputs)
79
 
80
- # Use mean pooling (excluding [CLS] & [SEP]) for sentence embedding
81
  token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
82
- sentence_embedding = token_embeddings.mean(dim=1)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: cc-by-4.0
4
+ tags:
5
+ - sentence-transformers
6
+ - feature-extraction
7
+ - sentence-similarity
8
+ - transformers
9
+ - bert
10
+ - accelerator-physics
11
+ - physics
12
+ - scientific-literature
13
+ - embeddings
14
+ - domain-specific
15
+ library_name: sentence-transformers
16
+ pipeline_tag: feature-extraction
17
+ base_model: thellert/physbert_cased
18
+ model-index:
19
+ - name: AccPhysBERT
20
+ results:
21
+ - task:
22
+ type: feature-extraction
23
+ name: Feature Extraction
24
+ dataset:
25
+ name: Accelerator Physics Publications
26
+ type: accelerator-physics
27
+ metrics:
28
+ - type: cosine_accuracy
29
+ value: 0.91
30
+ name: Citation Classification
31
+ - type: v_measure
32
+ value: 0.637
33
+ name: Category Clustering (main)
34
+ - type: ndcg_at_10
35
+ value: 0.663
36
+ name: Information Retrieval
37
+ datasets:
38
+ - inspire-hep
39
+ ---
40
+
41
  # AccPhysBERT
42
 
43
  **AccPhysBERT** is a specialized sentence-embedding model fine-tuned for **accelerator physics**, capturing semantic nuances in this technical domain. It delivers state-of-the-art performance in tasks such as semantic search, citation classification, reviewer matching, and clustering of accelerator-physics literature.
44
 
45
  ---
46
 
 
47
  ## Model Description
48
 
49
  - **Architecture**: BERT-based, fine-tuned from [PhysBERT (cased)](https://huggingface.co/thellert/physbert_cased/tree/main) using Supervised Contrastive Learning (SimCSE).
50
  - **Optimized For**: Titles, abstracts, proposals, and full text from the accelerator-physics community.
51
  - **Notable Features**:
52
+ - Trained on 109 k accelerator-physics publications from INSPIRE HEP
53
+ - Leverages 690 k citation pairs and 2 M synthetic query–source pairs
54
+ - Trained via SentenceTransformers to produce dense, semantically rich embeddings
55
 
56
  **Developed by**: Thorsten Hellert, João Montenegro, Marco Venturini, Andrea Pollastro
57
  **Funded by**: US Department of Energy, Lawrence Berkeley National Laboratory
58
  **Model Type**: Sentence embedding (BERT-based, SimCSE fine-tuned)
59
  **Language**: English
60
+ **License**: [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/)
61
+ **Paper**: *Domain-specific text embedding model for accelerator physics*, Phys. Rev. Accel. Beams 28, 044601 (2025)
62
+ [https://doi.org/10.1103/PhysRevAccelBeams.28.044601](https://doi.org/10.1103/PhysRevAccelBeams.28.044601)
63
 
64
  ---
65
 
66
  ## Training Data
67
 
68
+ - **Core Corpus**:
69
+ - 109,000 accelerator-physics publications (INSPIRE HEP category: "Accelerators")
70
+ - Over 1 GB of full-text markdown-style text (via OCR/Nougat)
71
+
72
+ - **Annotation Sources**:
73
  - 690,000 citation pairs
74
+ - 49 semantic categories labeled via ChatGPT-4o
75
  - 2,000,000 synthetic query–source pairs generated with LLaMA3-70B
 
 
76
 
77
  ---
78
 
79
  ## Training Procedure
80
 
81
+ - **Fine-tuning Method**: SimCSE (contrastive loss)
82
  - **Hyperparameters**:
83
  - Batch size: 512
84
+ - Learning rate: 2e-4
85
+ - Temperature: 0.05
86
  - Weight decay: 0.01
87
  - Optimizer: Adam
88
  - Epochs: 2
89
+ - **Infrastructure**: 32 × NVIDIA A100 GPUs @ NERSC
90
  - **Framework**: SentenceTransformers
91
 
92
  ---
93
 
94
  ## Evaluation Results
95
 
96
+ | Task | Metric | Score |
97
+ |----------------------------|--------------------------|---------|
98
+ | Citation Classification | Cosine Accuracy | 91.0% |
99
+ | Category Clustering | V‑measure (main/sub) | 63.7 / 77.2 |
100
+ | Information Retrieval | nDCG@10 | 66.3 |
101
 
102
+ AccPhysBERT outperforms BERT, SciBERT, and large general-purpose embedding models in all accelerator-specific benchmarks.
103
 
104
  ---
105
 
 
112
  tokenizer = AutoTokenizer.from_pretrained("thellert/accphysbert")
113
  model = AutoModel.from_pretrained("thellert/accphysbert")
114
 
115
+ text = "We report on beam instabilities observed in the LCLS-II injector."
116
  inputs = tokenizer(text, return_tensors="pt")
117
  outputs = model(**inputs)
118
 
119
+ # Use mean pooling (excluding [CLS] and [SEP])
120
  token_embeddings = outputs.last_hidden_state[:, 1:-1, :]
121
+ sentence_embedding = token_embeddings.mean(dim=1)
122
+ ```
123
+
124
+
125
+ ---
126
+
127
+ ## Citation
128
+
129
+ If you use AccPhysBERT, please cite:
130
+
131
+ ```bibtex
132
+ @article{Hellert_2025,
133
+ title = {Domain-specific text embedding model for accelerator physics},
134
+ author = {Hellert, Thorsten and Montenegro, João and Venturini, Marco and Pollastro, Andrea},
135
+ journal = {Physical Review Accelerators and Beams},
136
+ volume = {28},
137
+ number = {4},
138
+ pages = {044601},
139
+ year = {2025},
140
+ publisher = {American Physical Society},
141
+ doi = {10.1103/PhysRevAccelBeams.28.044601},
142
+ url = {https://doi.org/10.1103/PhysRevAccelBeams.28.044601}
143
+ }
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Contact
149
+
150
+ Thorsten Hellert
151
+ Lawrence Berkeley National Laboratory
152
153
+
154
+ ---
155
+
156
+ ## Acknowledgments
157
+
158
+ This model builds on PhysBERT and was trained using NERSC resources. Thanks to Alex Hexemer, Fernando Sannibale, and Antonin Sulc for their support and discussions.