File size: 2,993 Bytes
ce2b4a2
 
2d8340b
ce2b4a2
 
2d8340b
ce2b4a2
2d8340b
ce2b4a2
2d8340b
ce2b4a2
2d8340b
ce2b4a2
2d8340b
 
 
 
 
ce2b4a2
 
 
2d8340b
ce2b4a2
2d8340b
ce2b4a2
2d8340b
ce2b4a2
2d8340b
ce2b4a2
2d8340b
 
 
 
ce2b4a2
2d8340b
 
 
ce2b4a2
2d8340b
 
ce2b4a2
2d8340b
 
 
ce2b4a2
2d8340b
 
 
 
 
 
ce2b4a2
2d8340b
ce2b4a2
2d8340b
ce2b4a2
2d8340b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ce2b4a2
2d8340b
ce2b4a2
2d8340b
ce2b4a2
 
 
2d8340b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
library_name: transformers
tags: [physics, NLP, embedding, sentence-transformer]
---

# Model Card for PhysBERT

PhysBERT is a specialized text embedding model for physics, designed to improve information retrieval, citation classification, and clustering of physics literature. Trained on 1.2 million physics papers, it outperforms general-purpose models in physics-specific tasks.

## Model Description

PhysBERT is a BERT-based text embedding model for physics, fine-tuned using SimCSE for optimized physics-specific performance. This model enables efficient retrieval, categorization, and analysis of physics literature, achieving higher relevance and accuracy on domain-specific NLP tasks. The uncased version can be found [here](https://huggingface.co/thellert/physbert_uncased).

- **Developed by:** Thorsten Hellert, João Montenegro, Andrea Pollastro
- **Funded by:** US Department of Energy, Lawrence Berkeley National Laboratory
- **Model type:** Text embedding model (BERT-based)
- **Language(s) (NLP):** English
- **Paper:** [PhysBERT: A Text Embedding Model for Physics Scientific Literature](https://doi.org/10.1063/5.0238090)



## Training Data

Trained on a 40GB corpus from arXiv’s physics publications, consisting of 1.2 million documents, refined for scientific accuracy.

## Training Procedure

The model was pre-trained using Masked Language Modeling (MLM) and fine-tuned with SimCSE for sentence embeddings.

## Example of Usage
```python
from transformers import AutoTokenizer, AutoModel
import torch

# Load PhysBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("thellert/physbert_cased")
model = AutoModel.from_pretrained("thellert/physbert_cased")

# Sample text to embed
sample_text = "Electrons exhibit both particle and wave-like behavior."

# Tokenize the input text and pass it through the model
inputs = tokenizer(sample_text, return_tensors="pt")
outputs = model(**inputs)

# Extract the token embeddings
token_embeddings = outputs.last_hidden_state
# Drop CLS and SEP tokens, then take the mean for the sentence embedding
token_embeddings = token_embeddings[:, 1:-1, :]
sentence_embedding = token_embeddings.mean(dim=1)
```

## Citation

If you find this work useful please consider citing the following paper:

```
@article{10.1063/5.0238090,
    author = {Hellert, Thorsten and Montenegro, João and Pollastro, Andrea},
    title = "{PhysBERT: A text embedding model for physics scientific literature}",
    journal = {APL Machine Learning},
    volume = {2},
    number = {4},
    pages = {046105},
    year = {2024},
    month = {10},
    issn = {2770-9019},
    doi = {10.1063/5.0238090},
    url = {https://doi.org/10.1063/5.0238090},
    eprint = {https://pubs.aip.org/aip/aml/article-pdf/doi/10.1063/5.0238090/20227307/046105\_1\_5.0238090.pdf},
}
```

## Model Card Authors

Thorsten Hellert, João Montenegro, Andrea Pollastro

## Model Card Contact

Thorsten Hellert, Lawrence Berkeley National Laboratory, [email protected]