matthewleechen commited on
Commit
7eb7b5c
·
verified ·
1 Parent(s): e05597e

Add files using upload-large-folder tool

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
.ipynb_checkpoints/classification_report_lr_6.0000000000e-05_test-checkpoint.csv ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Entity,Precision,Recall,F1-Score,Support
2
+ TITLE,0.9390,0.9747,0.9565,79
3
+ micro avg,0.9390,0.9747,0.9565,79
4
+ macro avg,0.9390,0.9747,0.9565,79
5
+ weighted avg,0.9390,0.9747,0.9565,79
README.md ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ base_model:
5
+ - FacebookAI/xlm-roberta-large
6
+ pipeline_tag: token-classification
7
+ library_name: transformers
8
+ ---
9
+
10
+ # Patent Title Extraction Model
11
+
12
+ ### Model Description
13
+
14
+ **patent_titles_ner** is a fine-tuned [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model that has been trained on a custom dataset of OCR'd front pages of patent specifications published by the British Patent Office, and filed between 1617-1899. It has been trained to recognize the stated titles of inventions.
15
+
16
+ We take the original xlm-roberta-large [weights](https://huggingface.co/FacebookAI/xlm-roberta-large/blob/main/pytorch_model.bin) and fine tune on our custom dataset for 15 epochs with a learning rate of 6e-05 and a batch size of 21. We chose the learning rate by tuning on the validation set.
17
+
18
+ ### Usage
19
+
20
+ This model can be used with HuggingFace Transformer's Pipelines API for NER:
21
+
22
+ ```python
23
+ from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer
24
+
25
+ tokenizer = AutoTokenizer.from_pretrained("gbpatentdata/patent_titles_ner")
26
+ model = AutoModelForTokenClassification.from_pretrained("gbpatentdata/patent_titles_ner")
27
+
28
+
29
+ def custom_recognizer(text, model=model, tokenizer=tokenizer, device=0):
30
+
31
+ # HF ner pipeline
32
+ token_level_results = pipeline("ner", model=model, device=0, tokenizer=tokenizer)(text)
33
+
34
+ # keep entities tracked
35
+ entities = []
36
+ current_entity = None
37
+
38
+ for item in token_level_results:
39
+
40
+ tag = item['entity']
41
+
42
+ # replace '▁' with space for easier reading (_ is created by the XLM-RoBERTa tokenizer)
43
+ word = item['word'].replace('▁', ' ')
44
+
45
+ # aggregate I-O-B tagged entities
46
+ if tag.startswith('B-'):
47
+
48
+ if current_entity:
49
+ entities.append(current_entity)
50
+
51
+ current_entity = {'type': tag[2:], 'text': word.strip(), 'start': item['start'], 'end': item['end']}
52
+
53
+ elif tag.startswith('I-'):
54
+
55
+ if current_entity and tag[2:] == current_entity['type']:
56
+ current_entity['text'] += word
57
+ current_entity['end'] = item['end']
58
+
59
+ else:
60
+
61
+ if current_entity:
62
+ entities.append(current_entity)
63
+
64
+ current_entity = {'type': tag[2:], 'text': word.strip(), 'start': item['start'], 'end': item['end']}
65
+
66
+ else:
67
+ # deal with O tag
68
+ if current_entity:
69
+ entities.append(current_entity)
70
+ current_entity = None
71
+
72
+ if current_entity:
73
+ # add to entities
74
+ entities.append(current_entity)
75
+
76
+ # track entity merges
77
+ merged_entities = []
78
+
79
+ # merge entities of the same type
80
+ for entity in entities:
81
+ if merged_entities and merged_entities[-1]['type'] == entity['type'] and merged_entities[-1]['end'] == entity['start']:
82
+ merged_entities[-1]['text'] += entity['text']
83
+ merged_entities[-1]['end'] = entity['end']
84
+ else:
85
+ merged_entities.append(entity)
86
+
87
+ # clean up extra spaces
88
+ for entity in merged_entities:
89
+ entity['text'] = ' '.join(entity['text'].split())
90
+
91
+ # convert to list of dicts
92
+ return [{'class': entity['type'],
93
+ 'entity_text': entity['text'],
94
+ 'start': entity['start'],
95
+ 'end': entity['end']} for entity in merged_entities]
96
+
97
+
98
+
99
+ example = """
100
+ Date of Application, 1st Aug., 1890-Accepted, 6th Sept., 1890
101
+ COMPLETE SPECIFICATION.
102
+ Improvements in Coin-freed Apparatus for the Sale of Goods.
103
+ I, CHARLES LOTINGA, of 33 Cambridge Street, Lower Grange, Cardiff, in the County of Glamorgan, Gentleman,
104
+ do hereby declare the nature of this invention and in what manner the same is to be performed,
105
+ to be particularly described and ascertained in and by the following statement
106
+ """
107
+
108
+ ner_results = custom_recognizer(example)
109
+ print(ner_results)
110
+ ```
111
+
112
+ ### Training Data
113
+
114
+ The custom dataset of front page texts of patent specifications was assembled in the following steps:
115
+
116
+ 1. We fine tuned a YOLO vision [model](https://huggingface.co/gbpatentdata/yolov8_patent_layouts) to detect bounding boxes around text. We use this to identify text regions on the front pages of patent specifications.
117
+ 2. We use [Google Cloud Vision](https://cloud.google.com/vision?hl=en) to OCR the detected text regions, and then concatenate the OCR text.
118
+ 3. We randomly sample 200 front page texts (and another 201 oversampled from those that contain either firm or communicant information).
119
+
120
+ Our custom dataset has accurate manual labels generated by a graduate student. The final dataset is split 60-20-20 (train-val-test). In the event that the front page text is too long, we restrict the text to the first 512 tokens.
121
+
122
+ ### Evaluation
123
+
124
+ Our evaluation metric is F1 at the full entity-level. That is, we aggregated adjacent-indexed entities into full entities and computed F1 scores requiring an exact match. These scores for the test set are below.
125
+
126
+ <table>
127
+ <thead>
128
+ <tr>
129
+ <th>Full Entity</th>
130
+ <th>Precision</th>
131
+ <th>Recall</th>
132
+ <th>F1-Score</th>
133
+ </tr>
134
+ </thead>
135
+ <tbody>
136
+ <tr>
137
+ <td>TITLE</td>
138
+ <td>93.9%</td>
139
+ <td>97.5%</td>
140
+ <td>95.7%</td>
141
+ </tr>
142
+ </tbody>
143
+ </table>
144
+
145
+
146
+ ## Citation
147
+
148
+ If you use our model or custom training/evaluation data in your research, please cite our accompanying paper as follows:
149
+
150
+ ```bibtex
151
+ @article{bct2025,
152
+ title = {300 Years of British Patents},
153
+ author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero},
154
+ journal = {arXiv preprint arXiv:2401.12345},
155
+ year = {2025},
156
+ url = {https://arxiv.org/abs/2401.12345}
157
+ }
158
+ ```
classification_report_lr_6.0000000000e-05_test.csv ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Entity,Precision,Recall,F1-Score,Support
2
+ TITLE,0.9390,0.9747,0.9565,79
3
+ micro avg,0.9390,0.9747,0.9565,79
4
+ macro avg,0.9390,0.9747,0.9565,79
5
+ weighted avg,0.9390,0.9747,0.9565,79
classification_report_lr_6.0000000000e-05_val.csv ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ Entity,Precision,Recall,F1-Score,Support
2
+ TITLE,0.9625,0.9747,0.9686,79
3
+ micro avg,0.9625,0.9747,0.9686,79
4
+ macro avg,0.9625,0.9747,0.9686,79
5
+ weighted avg,0.9625,0.9747,0.9686,79
config.json ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "xlm-roberta-large",
3
+ "architectures": [
4
+ "XLMRobertaForTokenClassification"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 1024,
13
+ "id2label": {
14
+ "0": "B-TITLE",
15
+ "1": "I-TITLE",
16
+ "2": "O"
17
+ },
18
+ "initializer_range": 0.02,
19
+ "intermediate_size": 4096,
20
+ "label2id": {
21
+ "B-TITLE": 0,
22
+ "I-TITLE": 1,
23
+ "O": 2
24
+ },
25
+ "layer_norm_eps": 1e-05,
26
+ "max_position_embeddings": 514,
27
+ "model_type": "xlm-roberta",
28
+ "num_attention_heads": 16,
29
+ "num_hidden_layers": 24,
30
+ "output_past": true,
31
+ "pad_token_id": 1,
32
+ "position_embedding_type": "absolute",
33
+ "torch_dtype": "float32",
34
+ "transformers_version": "4.44.2",
35
+ "type_vocab_size": 1,
36
+ "use_cache": true,
37
+ "vocab_size": 250002
38
+ }
data_title.conll ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4c63cafaa8bd55eb4c4d6557a65a9129199e65e0d61101262f015cc21fa0c3ce
3
+ size 2235424156
sentencepiece.bpe.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cfc8146abe2a0488e9e2a0c56de7952f7c11ab059eca145a0a727afce0db2865
3
+ size 5069051
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
test_set_predictions_titles.json ADDED
The diff for this file is too large to render. See raw diff
 
test_titles.csv ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ffb37461c391f096759f4a9bbbc329da0f36952f88bab061fcf84940c022e98
3
+ size 17082999
tokenizer_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "250001": {
36
+ "content": "<mask>",
37
+ "lstrip": true,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 512,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "tokenizer_class": "XLMRobertaTokenizer",
53
+ "unk_token": "<unk>"
54
+ }
train_titles.csv ADDED
The diff for this file is too large to render. See raw diff
 
val_titles.csv ADDED
The diff for this file is too large to render. See raw diff