matthewleechen commited on
Commit
b0f0bf1
·
verified ·
1 Parent(s): 9675365

Add files using upload-large-folder tool

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
.ipynb_checkpoints/data_split_test-checkpoint.csv ADDED
The diff for this file is too large to render. See raw diff
 
README.md ADDED
@@ -0,0 +1,214 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - FacebookAI/xlm-roberta-large
6
+ pipeline_tag: token-classification
7
+ library_name: transformers
8
+ ---
9
+
10
+ # Patent Entity Extraction Model
11
+
12
+ ### Model Description
13
+
14
+ **patent_entities_ner** is a fine-tuned [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model that has been trained on a custom dataset of OCR'd front pages of patent specifications published by the British Patent Office, and filed between 1617-1899.
15
+
16
+ It has been trained to recognize six classes of named entities:
17
+
18
+ - PER: full name of inventor
19
+ - OCC: occupation of inventor
20
+ - ADD: full (permanent) address of inventor
21
+ - DATE: patent filing, submission, or approval dates
22
+ - FIRM: name of firm affiliated with inventor
23
+ - COMM: name and information mentioned about communicant
24
+
25
+ We take the original xlm-roberta-large [weights](https://huggingface.co/FacebookAI/xlm-roberta-large/blob/main/pytorch_model.bin) and fine tune on our custom dataset for 29 epochs with a learning rate of 5e-05 and a batch size of 21. We chose the learning rate by tuning on the validation set.
26
+
27
+ ### Usage
28
+
29
+ This model can be used with HuggingFace Transformer's Pipelines API for NER:
30
+
31
+ ```python
32
+ from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
33
+
34
+ tokenizer = AutoTokenizer.from_pretrained("gbpatentdata/patent_entities_ner")
35
+ model = AutoModelForTokenClassification.from_pretrained("gbpatentdata/patent_entities_ner")
36
+
37
+
38
+ def custom_recognizer(text, model=model, tokenizer=tokenizer, device=0):
39
+
40
+ # HF ner pipeline
41
+ token_level_results = pipeline("ner", model=model, device=0, tokenizer=tokenizer)(text)
42
+
43
+ # keep entities tracked
44
+ entities = []
45
+ current_entity = None
46
+
47
+ for item in token_level_results:
48
+
49
+ tag = item['entity']
50
+
51
+ # replace '▁' with space for easier reading (_ is created by the XLM-RoBERTa tokenizer)
52
+ word = item['word'].replace('▁', ' ')
53
+
54
+ # aggregate I-O-B tagged entities
55
+ if tag.startswith('B-'):
56
+
57
+ if current_entity:
58
+ entities.append(current_entity)
59
+
60
+ current_entity = {'type': tag[2:], 'text': word.strip(), 'start': item['start'], 'end': item['end']}
61
+
62
+ elif tag.startswith('I-'):
63
+
64
+ if current_entity and tag[2:] == current_entity['type']:
65
+ current_entity['text'] += word
66
+ current_entity['end'] = item['end']
67
+
68
+ else:
69
+
70
+ if current_entity:
71
+ entities.append(current_entity)
72
+
73
+ current_entity = {'type': tag[2:], 'text': word.strip(), 'start': item['start'], 'end': item['end']}
74
+
75
+ else:
76
+ # deal with O tag
77
+ if current_entity:
78
+ entities.append(current_entity)
79
+ current_entity = None
80
+
81
+ if current_entity:
82
+ # add to entities
83
+ entities.append(current_entity)
84
+
85
+ # track entity merges
86
+ merged_entities = []
87
+
88
+ # merge entities of the same type
89
+ for entity in entities:
90
+ if merged_entities and merged_entities[-1]['type'] == entity['type'] and merged_entities[-1]['end'] == entity['start']:
91
+ merged_entities[-1]['text'] += entity['text']
92
+ merged_entities[-1]['end'] = entity['end']
93
+ else:
94
+ merged_entities.append(entity)
95
+
96
+ # clean up extra spaces
97
+ for entity in merged_entities:
98
+ entity['text'] = ' '.join(entity['text'].split())
99
+
100
+ # convert to list of dicts
101
+ return [{'class': entity['type'],
102
+ 'entity_text': entity['text'],
103
+ 'start': entity['start'],
104
+ 'end': entity['end']} for entity in merged_entities]
105
+
106
+
107
+
108
+ example = """
109
+ Date of Application, 1st Aug., 1890-Accepted, 6th Sept., 1890
110
+ COMPLETE SPECIFICATION.
111
+ Improvements in Coin-freed Apparatus for the Sale of Goods.
112
+ I, CHARLES LOTINGA, of 33 Cambridge Street, Lower Grange, Cardiff, in the County of Glamorgan, Gentleman,
113
+ do hereby declare the nature of this invention and in what manner the same is to be performed,
114
+ to be particularly described and ascertained in and by the following statement
115
+ """
116
+
117
+ ner_results = custom_recognizer(example)
118
+ print(ner_results)
119
+ ```
120
+
121
+ ### Training Data
122
+
123
+ The custom dataset of front page texts of patent specifications was assembled in the following steps:
124
+
125
+ 1. We fine tuned a YOLO vision [model](https://huggingface.co/gbpatentdata/yolov8_patent_layouts) to detect bounding boxes around text. We use this to identify text regions on the front pages of patent specifications.
126
+ 2. We use [Google Cloud Vision](https://cloud.google.com/vision?hl=en) to OCR the detected text regions, and then concatenate the OCR text.
127
+ 3. We randomly sample 200 front page texts (and another 201 oversampled from those that contain either firm or communicant information).
128
+
129
+ Our custom dataset has accurate manual labels created jointly by an undergraduate student and an economics professor. The final dataset is split 60-20-20 (train-val-test). In the event that the front page text is too long, we restrict the text to the first 512 tokens.
130
+
131
+ ### Evaluation
132
+
133
+ Our evaluation metric is F1 at the full entity-level. That is, we aggregated adjacent-indexed entities into full entities and computed F1 scores requiring an exact match. These scores for the test set are below.
134
+
135
+ <table>
136
+ <thead>
137
+ <tr>
138
+ <th>Full Entity</th>
139
+ <th>Precision</th>
140
+ <th>Recall</th>
141
+ <th>F1-Score</th>
142
+ </tr>
143
+ </thead>
144
+ <tbody>
145
+ <tr>
146
+ <td>PER</td>
147
+ <td>92.2%</td>
148
+ <td>97.7%</td>
149
+ <td>94.9%</td>
150
+ </tr>
151
+ <tr>
152
+ <td>OCC</td>
153
+ <td>93.8%</td>
154
+ <td>93.8%</td>
155
+ <td>93.8%</td>
156
+ </tr>
157
+ <tr>
158
+ <td>ADD</td>
159
+ <td>88.6%</td>
160
+ <td>91.2%</td>
161
+ <td>89.9%</td>
162
+ </tr>
163
+ <tr>
164
+ <td>DATE</td>
165
+ <td>93.7%</td>
166
+ <td>98.7%</td>
167
+ <td>96.1%</td>
168
+ </tr>
169
+ <tr>
170
+ <td>FIRM</td>
171
+ <td>64.0%</td>
172
+ <td>94.1%</td>
173
+ <td>76.2%</td>
174
+ </tr>
175
+ <tr>
176
+ <td>COMM</td>
177
+ <td>77.1%</td>
178
+ <td>87.1%</td>
179
+ <td>81.8%</td>
180
+ </tr>
181
+ <tr>
182
+ <td>Overall (micro avg)</td>
183
+ <td>89.9%</td>
184
+ <td>95.3%</td>
185
+ <td>92.5%</td>
186
+ </tr>
187
+ <tr>
188
+ <td>Overall (macro avg)</td>
189
+ <td>84.9%</td>
190
+ <td>93.8%</td>
191
+ <td>88.9%</td>
192
+ </tr>
193
+ <tr>
194
+ <td>Overall (weighted avg)</td>
195
+ <td>90.3%</td>
196
+ <td>95.3%</td>
197
+ <td>92.7%</td>
198
+ </tr>
199
+ </tbody>
200
+ </table>
201
+
202
+ ## Citation
203
+
204
+ If you use our model or custom training/evaluation data in your research, please cite our accompanying paper as follows:
205
+
206
+ ```bibtex
207
+ @article{bct2025,
208
+ title = {300 Years of British Patents},
209
+ author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero},
210
+ journal = {arXiv preprint arXiv:2401.12345},
211
+ year = {2025},
212
+ url = {https://arxiv.org/abs/2401.12345}
213
+ }
214
+ ```
classification_report_lr_5.0000000000e-05_test.csv ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Entity,Precision,Recall,F1-Score,Support
2
+ PER,0.9220,0.9774,0.9489,133
3
+ FIRM,0.6400,0.9412,0.7619,17
4
+ COMM,0.7714,0.8710,0.8182,31
5
+ DATE,0.9367,0.9867,0.9610,150
6
+ ADD,0.8857,0.9118,0.8986,102
7
+ OCC,0.9383,0.9383,0.9383,81
8
+ micro avg,0.8991,0.9533,0.9254,514
9
+ macro avg,0.8490,0.9377,0.8878,514
10
+ weighted avg,0.9032,0.9533,0.9267,514
classification_report_lr_5.0000000000e-05_val.csv ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Entity,Precision,Recall,F1-Score,Support
2
+ OCC,0.9222,0.9540,0.9379,87
3
+ PER,0.9530,0.9793,0.9660,145
4
+ DATE,0.9732,1.0000,0.9864,145
5
+ ADD,0.8785,0.9216,0.8995,102
6
+ COMM,0.9062,0.9355,0.9206,31
7
+ FIRM,0.7368,0.8750,0.8000,16
8
+ micro avg,0.9286,0.9639,0.9459,526
9
+ macro avg,0.8950,0.9442,0.9184,526
10
+ weighted avg,0.9297,0.9639,0.9463,526
config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "xlm-roberta-large",
3
+ "architectures": [
4
+ "XLMRobertaForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 1024,
13
+ "id2label": {
14
+ "0": "B-ADD",
15
+ "1": "B-COMM",
16
+ "2": "B-DATE",
17
+ "3": "B-FIRM",
18
+ "4": "B-OCC",
19
+ "5": "B-PER",
20
+ "6": "I-ADD",
21
+ "7": "I-COMM",
22
+ "8": "I-DATE",
23
+ "9": "I-FIRM",
24
+ "10": "I-OCC",
25
+ "11": "I-PER",
26
+ "12": "O"
27
+ },
28
+ "initializer_range": 0.02,
29
+ "intermediate_size": 4096,
30
+ "label2id": {
31
+ "B-ADD": 0,
32
+ "B-COMM": 1,
33
+ "B-DATE": 2,
34
+ "B-FIRM": 3,
35
+ "B-OCC": 4,
36
+ "B-PER": 5,
37
+ "I-ADD": 6,
38
+ "I-COMM": 7,
39
+ "I-DATE": 8,
40
+ "I-FIRM": 9,
41
+ "I-OCC": 10,
42
+ "I-PER": 11,
43
+ "O": 12
44
+ },
45
+ "layer_norm_eps": 1e-05,
46
+ "max_position_embeddings": 514,
47
+ "model_type": "xlm-roberta",
48
+ "num_attention_heads": 16,
49
+ "num_hidden_layers": 24,
50
+ "output_past": true,
51
+ "pad_token_id": 1,
52
+ "position_embedding_type": "absolute",
53
+ "torch_dtype": "float32",
54
+ "transformers_version": "4.44.2",
55
+ "type_vocab_size": 1,
56
+ "use_cache": true,
57
+ "vocab_size": 250002
58
+ }
data_split_test.csv ADDED
The diff for this file is too large to render. See raw diff
 
data_split_train.csv ADDED
The diff for this file is too large to render. See raw diff
 
data_split_val.csv ADDED
The diff for this file is too large to render. See raw diff
 
labelled_data.conll ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:23a649001c7829253d55b444ef96efb2748c2c1d6c9643972857373bf4a924a4
3
+ size 2235465156
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
test_set_predictions.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ffb37461c391f096759f4a9bbbc329da0f36952f88bab061fcf84940c022e98
3
+ size 17082999
tokenizer_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 512,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "tokenizer_class": "XLMRobertaTokenizer",
53
+ "unk_token": "<unk>"
54
+ }