matthewleechen commited on
Commit
8940132
·
verified ·
1 Parent(s): 5c60ce5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -57
README.md CHANGED
@@ -1,77 +1,114 @@
1
  ---
 
 
 
 
 
2
  library_name: transformers
3
- license: mit
4
- base_model: FacebookAI/xlm-roberta-large
5
- tags:
6
- - generated_from_trainer
7
- model-index:
8
- - name: multiclass-classifier-patents
9
- results: []
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
 
15
- # multiclass-classifier-patents
16
 
17
- This model is a fine-tuned version of [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) on the None dataset.
18
- It achieves the following results on the evaluation set:
19
- - Loss: 0.0067
20
- - F1 Micro: 0.7001
21
- - Precision Micro: 0.8337
22
- - Recall Micro: 0.6034
23
- - Exact Match F1: 0.5296
24
- - Exact Match Precision: 0.5296
25
- - Exact Match Recall: 0.5296
26
- - Any Match F1: 0.9079
27
- - Any Match Precision: 0.9079
28
- - Any Match Recall: 0.9079
29
 
30
- ## Model description
31
 
32
- More information needed
33
 
34
- ## Intended uses & limitations
35
 
36
- More information needed
37
 
38
- ## Training and evaluation data
 
39
 
40
- More information needed
41
 
42
- ## Training procedure
 
43
 
44
- ### Training hyperparameters
 
 
 
 
 
 
45
 
46
- The following hyperparameters were used during training:
47
- - learning_rate: 2e-05
48
- - train_batch_size: 64
49
- - eval_batch_size: 64
50
- - seed: 42
51
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
52
- - lr_scheduler_type: linear
53
- - num_epochs: 10
54
- - mixed_precision_training: Native AMP
55
 
56
- ### Training results
57
 
58
- | Training Loss | Epoch | Step | Validation Loss | F1 Micro | Precision Micro | Recall Micro | Exact Match F1 | Exact Match Precision | Exact Match Recall | Any Match F1 | Any Match Precision | Any Match Recall |
59
- |:-------------:|:-----:|:-----:|:---------------:|:--------:|:---------------:|:------------:|:--------------:|:---------------------:|:------------------:|:------------:|:-------------------:|:----------------:|
60
- | 0.01 | 1.0 | 1292 | 0.0083 | 0.5977 | 0.8265 | 0.4681 | 0.4300 | 0.4300 | 0.4300 | 0.7675 | 0.7675 | 0.7675 |
61
- | 0.0077 | 2.0 | 2584 | 0.0074 | 0.6595 | 0.8326 | 0.5460 | 0.4879 | 0.4879 | 0.4879 | 0.8636 | 0.8636 | 0.8636 |
62
- | 0.007 | 3.0 | 3876 | 0.0071 | 0.6829 | 0.8173 | 0.5864 | 0.5035 | 0.5035 | 0.5035 | 0.8958 | 0.8958 | 0.8958 |
63
- | 0.0063 | 4.0 | 5168 | 0.0069 | 0.6883 | 0.8317 | 0.5871 | 0.5140 | 0.5140 | 0.5140 | 0.8956 | 0.8956 | 0.8956 |
64
- | 0.0058 | 5.0 | 6460 | 0.0068 | 0.6957 | 0.8337 | 0.5969 | 0.5182 | 0.5182 | 0.5182 | 0.9058 | 0.9058 | 0.9058 |
65
- | 0.0053 | 6.0 | 7752 | 0.0069 | 0.6999 | 0.8366 | 0.6017 | 0.5271 | 0.5271 | 0.5271 | 0.9082 | 0.9082 | 0.9082 |
66
- | 0.0048 | 7.0 | 9044 | 0.0069 | 0.7046 | 0.8159 | 0.6201 | 0.5225 | 0.5225 | 0.5225 | 0.9185 | 0.9185 | 0.9185 |
67
- | 0.0046 | 8.0 | 10336 | 0.0069 | 0.7069 | 0.8100 | 0.6271 | 0.5241 | 0.5241 | 0.5241 | 0.9196 | 0.9196 | 0.9196 |
68
- | 0.0042 | 9.0 | 11628 | 0.0070 | 0.7064 | 0.8208 | 0.6200 | 0.5282 | 0.5282 | 0.5282 | 0.9174 | 0.9174 | 0.9174 |
69
- | 0.004 | 10.0 | 12920 | 0.0070 | 0.7064 | 0.8184 | 0.6214 | 0.5276 | 0.5276 | 0.5276 | 0.9177 | 0.9177 | 0.9177 |
70
 
 
71
 
72
- ### Framework versions
 
 
73
 
74
- - Transformers 4.45.2
75
- - Pytorch 2.0.1+cu117
76
- - Datasets 3.0.1
77
- - Tokenizers 0.20.3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - FacebookAI/xlm-roberta-large
6
+ pipeline_tag: text-classification
7
  library_name: transformers
 
 
 
 
 
 
 
8
  ---
9
 
10
+ # Patent Classification Model
 
11
 
12
+ ### Model Description
13
 
14
+ **multilabel_patent_classifier** is a fine-tuned [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model that has been trained on patent class information between 1855-1883 made available [here](http://walkerhanlon.com/data_resources/british_patent_classification_database.zip).
 
 
 
 
 
 
 
 
 
 
 
15
 
16
+ It has been trained to recognize 146 classes of named entities outlined by the British Patent Office. These are made available [here](https://huggingface.co/matthewleechen/multiclass-classifier-patents/edit/main/BPO_classes.csv).
17
 
18
+ We take the original xlm-roberta-large [weights](https://huggingface.co/FacebookAI/xlm-roberta-large/blob/main/pytorch_model.bin) and fine tune on our custom dataset for 10 epochs with a learning rate of 2e-05 and a batch size of 64.
19
 
20
+ ### Usage
21
 
22
+ This model can be used with HuggingFace Transformer's Pipelines API for NER:
23
 
24
+ ```python
25
+ from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
26
 
27
+ model_name = "matthewleechen/multilabel_patent_classifier"
28
 
29
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
30
+ model = AutoModelForTokenClassification.from_pretrained(model_name)
31
 
32
+ pipe = pipeline(
33
+ task="text-classification",
34
+ model=model,
35
+ device = 0,
36
+ tokenizer=tokenizer,
37
+ return_all_scores=True
38
+ )
39
 
40
+ ```
 
 
 
 
 
 
 
 
41
 
42
+ ### Training Data
43
 
44
+ Our training data consists of patent titles labelled with 0-1 tags for each patent class. Labels were generated by the British Patent Office between 1855-1883 and our patent titles were extracted from the front pages of our specification texts using a patent title NER [model](https://huggingface.co/matthewleechen/patent_titles_ner).
 
 
 
 
 
 
 
 
 
 
 
45
 
46
+ ### Training Procedure
47
 
48
+ We use the standard multi-label classification protocols with the HuggingFace Trainer API, but replace the default `BCEWithLogitsLoss` with a [focal loss](https://arxiv.org/pdf/1708.02002) function (α=1, γ=2) to address class imbalance. Both during evaluation and at inference, we apply a sigmoid to each logit and use a 0.5 threshold to determine positive labels for each class.
49
+
50
+ ### Evaluation
51
 
52
+ We compute precision, recall, and F1 for each class (with a 0.5 sigmoid threshold), as well as exact match (only if ground truth and predicted classes are identical) and any match (if any overlap between ground truth and predicted classes) percentages.
53
+
54
+ These scores are aggregated for the test set below.
55
+
56
+ <table>
57
+ <thead>
58
+ <tr>
59
+ <th>Metric Type</th>
60
+ <th>Precision (Micro)</th>
61
+ <th>Recall (Micro)</th>
62
+ <th>F1 (Micro)</th>
63
+ <th>Exact Match</th>
64
+ <th>Any Match</th>
65
+ </tr>
66
+ </thead>
67
+ <tbody>
68
+ <tr>
69
+ <td>Micro Average</td>
70
+ <td>83.4%</td>
71
+ <td>60.3%</td>
72
+ <td>70.0%</td>
73
+ <td>52.9%</td>
74
+ <td>90.8%</td>
75
+ </tr>
76
+ </tbody>
77
+ </table>
78
+
79
+
80
+ ## References
81
+
82
+ ```bibtex
83
+ @misc{hanlon2016,
84
+ title = {{British Patent Technology Classification Database: 1855–1882}},
85
+ author = {Hanlon, Walker},
86
+ year = {2016},
87
+ url = {http://www.econ.ucla.edu/whanlon/},
88
+ note = {Available at: \url{http://www.econ.ucla.edu/whanlon/}}
89
+ }
90
+
91
+ @misc{lin2018focallossdenseobject,
92
+ title={Focal Loss for Dense Object Detection},
93
+ author={Tsung-Yi Lin and Priya Goyal and Ross Girshick and Kaiming He and Piotr Dollár},
94
+ year={2018},
95
+ eprint={1708.02002},
96
+ archivePrefix={arXiv},
97
+ primaryClass={cs.CV},
98
+ url={https://arxiv.org/abs/1708.02002},
99
+ }
100
+ ```
101
+
102
+ ## Citation
103
+
104
+ If you use our model in your research, please cite our accompanying paper as follows:
105
+
106
+ ```bibtex
107
+ @article{bct2025,
108
+ title = {300 Years of British Patents},
109
+ author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero},
110
+ journal = {arXiv preprint arXiv:2401.12345},
111
+ year = {2025},
112
+ url = {https://arxiv.org/abs/2401.12345}
113
+ }
114
+ ```