Update README.md
Browse files
README.md
CHANGED
@@ -1,77 +1,114 @@
|
|
1 |
---
|
|
|
|
|
|
|
|
|
|
|
2 |
library_name: transformers
|
3 |
-
license: mit
|
4 |
-
base_model: FacebookAI/xlm-roberta-large
|
5 |
-
tags:
|
6 |
-
- generated_from_trainer
|
7 |
-
model-index:
|
8 |
-
- name: multiclass-classifier-patents
|
9 |
-
results: []
|
10 |
---
|
11 |
|
12 |
-
|
13 |
-
should probably proofread and complete it, then remove this comment. -->
|
14 |
|
15 |
-
|
16 |
|
17 |
-
|
18 |
-
It achieves the following results on the evaluation set:
|
19 |
-
- Loss: 0.0067
|
20 |
-
- F1 Micro: 0.7001
|
21 |
-
- Precision Micro: 0.8337
|
22 |
-
- Recall Micro: 0.6034
|
23 |
-
- Exact Match F1: 0.5296
|
24 |
-
- Exact Match Precision: 0.5296
|
25 |
-
- Exact Match Recall: 0.5296
|
26 |
-
- Any Match F1: 0.9079
|
27 |
-
- Any Match Precision: 0.9079
|
28 |
-
- Any Match Recall: 0.9079
|
29 |
|
30 |
-
|
31 |
|
32 |
-
|
33 |
|
34 |
-
|
35 |
|
36 |
-
|
37 |
|
38 |
-
|
|
|
39 |
|
40 |
-
|
41 |
|
42 |
-
|
|
|
43 |
|
44 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
-
|
47 |
-
- learning_rate: 2e-05
|
48 |
-
- train_batch_size: 64
|
49 |
-
- eval_batch_size: 64
|
50 |
-
- seed: 42
|
51 |
-
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
52 |
-
- lr_scheduler_type: linear
|
53 |
-
- num_epochs: 10
|
54 |
-
- mixed_precision_training: Native AMP
|
55 |
|
56 |
-
### Training
|
57 |
|
58 |
-
|
59 |
-
|:-------------:|:-----:|:-----:|:---------------:|:--------:|:---------------:|:------------:|:--------------:|:---------------------:|:------------------:|:------------:|:-------------------:|:----------------:|
|
60 |
-
| 0.01 | 1.0 | 1292 | 0.0083 | 0.5977 | 0.8265 | 0.4681 | 0.4300 | 0.4300 | 0.4300 | 0.7675 | 0.7675 | 0.7675 |
|
61 |
-
| 0.0077 | 2.0 | 2584 | 0.0074 | 0.6595 | 0.8326 | 0.5460 | 0.4879 | 0.4879 | 0.4879 | 0.8636 | 0.8636 | 0.8636 |
|
62 |
-
| 0.007 | 3.0 | 3876 | 0.0071 | 0.6829 | 0.8173 | 0.5864 | 0.5035 | 0.5035 | 0.5035 | 0.8958 | 0.8958 | 0.8958 |
|
63 |
-
| 0.0063 | 4.0 | 5168 | 0.0069 | 0.6883 | 0.8317 | 0.5871 | 0.5140 | 0.5140 | 0.5140 | 0.8956 | 0.8956 | 0.8956 |
|
64 |
-
| 0.0058 | 5.0 | 6460 | 0.0068 | 0.6957 | 0.8337 | 0.5969 | 0.5182 | 0.5182 | 0.5182 | 0.9058 | 0.9058 | 0.9058 |
|
65 |
-
| 0.0053 | 6.0 | 7752 | 0.0069 | 0.6999 | 0.8366 | 0.6017 | 0.5271 | 0.5271 | 0.5271 | 0.9082 | 0.9082 | 0.9082 |
|
66 |
-
| 0.0048 | 7.0 | 9044 | 0.0069 | 0.7046 | 0.8159 | 0.6201 | 0.5225 | 0.5225 | 0.5225 | 0.9185 | 0.9185 | 0.9185 |
|
67 |
-
| 0.0046 | 8.0 | 10336 | 0.0069 | 0.7069 | 0.8100 | 0.6271 | 0.5241 | 0.5241 | 0.5241 | 0.9196 | 0.9196 | 0.9196 |
|
68 |
-
| 0.0042 | 9.0 | 11628 | 0.0070 | 0.7064 | 0.8208 | 0.6200 | 0.5282 | 0.5282 | 0.5282 | 0.9174 | 0.9174 | 0.9174 |
|
69 |
-
| 0.004 | 10.0 | 12920 | 0.0070 | 0.7064 | 0.8184 | 0.6214 | 0.5276 | 0.5276 | 0.5276 | 0.9177 | 0.9177 | 0.9177 |
|
70 |
|
|
|
71 |
|
72 |
-
|
|
|
|
|
73 |
|
74 |
-
|
75 |
-
|
76 |
-
|
77 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
base_model:
|
5 |
+
- FacebookAI/xlm-roberta-large
|
6 |
+
pipeline_tag: text-classification
|
7 |
library_name: transformers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
---
|
9 |
|
10 |
+
# Patent Classification Model
|
|
|
11 |
|
12 |
+
### Model Description
|
13 |
|
14 |
+
**multilabel_patent_classifier** is a fine-tuned [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model that has been trained on patent class information between 1855-1883 made available [here](http://walkerhanlon.com/data_resources/british_patent_classification_database.zip).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
|
16 |
+
It has been trained to recognize 146 classes of named entities outlined by the British Patent Office. These are made available [here](https://huggingface.co/matthewleechen/multiclass-classifier-patents/edit/main/BPO_classes.csv).
|
17 |
|
18 |
+
We take the original xlm-roberta-large [weights](https://huggingface.co/FacebookAI/xlm-roberta-large/blob/main/pytorch_model.bin) and fine tune on our custom dataset for 10 epochs with a learning rate of 2e-05 and a batch size of 64.
|
19 |
|
20 |
+
### Usage
|
21 |
|
22 |
+
This model can be used with HuggingFace Transformer's Pipelines API for NER:
|
23 |
|
24 |
+
```python
|
25 |
+
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
|
26 |
|
27 |
+
model_name = "matthewleechen/multilabel_patent_classifier"
|
28 |
|
29 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
30 |
+
model = AutoModelForTokenClassification.from_pretrained(model_name)
|
31 |
|
32 |
+
pipe = pipeline(
|
33 |
+
task="text-classification",
|
34 |
+
model=model,
|
35 |
+
device = 0,
|
36 |
+
tokenizer=tokenizer,
|
37 |
+
return_all_scores=True
|
38 |
+
)
|
39 |
|
40 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
41 |
|
42 |
+
### Training Data
|
43 |
|
44 |
+
Our training data consists of patent titles labelled with 0-1 tags for each patent class. Labels were generated by the British Patent Office between 1855-1883 and our patent titles were extracted from the front pages of our specification texts using a patent title NER [model](https://huggingface.co/matthewleechen/patent_titles_ner).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
45 |
|
46 |
+
### Training Procedure
|
47 |
|
48 |
+
We use the standard multi-label classification protocols with the HuggingFace Trainer API, but replace the default `BCEWithLogitsLoss` with a [focal loss](https://arxiv.org/pdf/1708.02002) function (α=1, γ=2) to address class imbalance. Both during evaluation and at inference, we apply a sigmoid to each logit and use a 0.5 threshold to determine positive labels for each class.
|
49 |
+
|
50 |
+
### Evaluation
|
51 |
|
52 |
+
We compute precision, recall, and F1 for each class (with a 0.5 sigmoid threshold), as well as exact match (only if ground truth and predicted classes are identical) and any match (if any overlap between ground truth and predicted classes) percentages.
|
53 |
+
|
54 |
+
These scores are aggregated for the test set below.
|
55 |
+
|
56 |
+
<table>
|
57 |
+
<thead>
|
58 |
+
<tr>
|
59 |
+
<th>Metric Type</th>
|
60 |
+
<th>Precision (Micro)</th>
|
61 |
+
<th>Recall (Micro)</th>
|
62 |
+
<th>F1 (Micro)</th>
|
63 |
+
<th>Exact Match</th>
|
64 |
+
<th>Any Match</th>
|
65 |
+
</tr>
|
66 |
+
</thead>
|
67 |
+
<tbody>
|
68 |
+
<tr>
|
69 |
+
<td>Micro Average</td>
|
70 |
+
<td>83.4%</td>
|
71 |
+
<td>60.3%</td>
|
72 |
+
<td>70.0%</td>
|
73 |
+
<td>52.9%</td>
|
74 |
+
<td>90.8%</td>
|
75 |
+
</tr>
|
76 |
+
</tbody>
|
77 |
+
</table>
|
78 |
+
|
79 |
+
|
80 |
+
## References
|
81 |
+
|
82 |
+
```bibtex
|
83 |
+
@misc{hanlon2016,
|
84 |
+
title = {{British Patent Technology Classification Database: 1855–1882}},
|
85 |
+
author = {Hanlon, Walker},
|
86 |
+
year = {2016},
|
87 |
+
url = {http://www.econ.ucla.edu/whanlon/},
|
88 |
+
note = {Available at: \url{http://www.econ.ucla.edu/whanlon/}}
|
89 |
+
}
|
90 |
+
|
91 |
+
@misc{lin2018focallossdenseobject,
|
92 |
+
title={Focal Loss for Dense Object Detection},
|
93 |
+
author={Tsung-Yi Lin and Priya Goyal and Ross Girshick and Kaiming He and Piotr Dollár},
|
94 |
+
year={2018},
|
95 |
+
eprint={1708.02002},
|
96 |
+
archivePrefix={arXiv},
|
97 |
+
primaryClass={cs.CV},
|
98 |
+
url={https://arxiv.org/abs/1708.02002},
|
99 |
+
}
|
100 |
+
```
|
101 |
+
|
102 |
+
## Citation
|
103 |
+
|
104 |
+
If you use our model in your research, please cite our accompanying paper as follows:
|
105 |
+
|
106 |
+
```bibtex
|
107 |
+
@article{bct2025,
|
108 |
+
title = {300 Years of British Patents},
|
109 |
+
author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero},
|
110 |
+
journal = {arXiv preprint arXiv:2401.12345},
|
111 |
+
year = {2025},
|
112 |
+
url = {https://arxiv.org/abs/2401.12345}
|
113 |
+
}
|
114 |
+
```
|