File size: 5,615 Bytes
6375617
b0c0d4d
 
 
 
 
 
6375617
 
51564f6
6375617
 
 
51564f6
6375617
ac65b98
6375617
ac65b98
6375617
ac65b98
6375617
ac65b98
 
6375617
68a5a8c
 
6375617
 
ac65b98
6375617
ac65b98
 
6375617
ac65b98
 
 
6375617
ac65b98
6375617
ac65b98
6375617
ac65b98
 
6375617
ac65b98
 
6375617
ac65b98
 
6375617
ac65b98
6375617
ac65b98
6375617
ac65b98
 
 
6375617
ac65b98
6375617
ac65b98
 
6375617
ac65b98
6375617
ac65b98
 
 
 
 
6375617
ac65b98
 
 
6375617
ac65b98
 
6375617
ac65b98
 
 
 
 
 
 
6375617
ac65b98
 
 
6375617
ac65b98
 
 
 
 
6375617
 
 
ac65b98
 
 
 
 
 
 
 
6375617
ac65b98
 
 
6375617
ac65b98
6375617
ac65b98
 
 
 
 
 
51564f6
ac65b98
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51564f6
 
 
 
ac65b98
 
b503f9a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
---
language:
- en
base_model:
- FacebookAI/xlm-roberta-large
pipeline_tag: token-classification
library_name: transformers
---

# Patent Title Extraction Model

### Model Description

**patent_titles_ner** is a fine-tuned [XLM-RoBERTa-large](https://huggingface.co/FacebookAI/xlm-roberta-large) model that has been trained on a custom dataset of OCR'd front pages of patent specifications published by the British Patent Office, and filed between 1617-1899. It has been trained to recognize the stated titles of inventions.

We take the original xlm-roberta-large [weights](https://huggingface.co/FacebookAI/xlm-roberta-large/blob/main/pytorch_model.bin) and fine tune on our custom dataset for 15 epochs with a learning rate of 6e-05 and a batch size of 21. We chose the learning rate by tuning on the validation set.

### Usage

This model can be used with HuggingFace Transformer's Pipelines API for NER: 

```python
from transformers import pipeline, AutoModelForTokenClassification, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("gbpatentdata/patent_titles_ner")
model = AutoModelForTokenClassification.from_pretrained("gbpatentdata/patent_titles_ner")


def custom_recognizer(text, model=model, tokenizer=tokenizer, device=0):

    # HF ner pipeline
    token_level_results = pipeline("ner", model=model, device=0, tokenizer=tokenizer)(text)

    # keep entities tracked
    entities = []
    current_entity = None

    for item in token_level_results:

        tag = item['entity']

        # replace '▁' with space for easier reading (_ is created by the XLM-RoBERTa tokenizer)
        word = item['word'].replace('▁', ' ')

        # aggregate I-O-B tagged entities
        if tag.startswith('B-'):

            if current_entity:
                entities.append(current_entity)

            current_entity = {'type': tag[2:], 'text': word.strip(), 'start': item['start'], 'end': item['end']}

        elif tag.startswith('I-'):

            if current_entity and tag[2:] == current_entity['type']:
                current_entity['text'] += word
                current_entity['end'] = item['end']

            else:

                if current_entity:
                    entities.append(current_entity)

                current_entity = {'type': tag[2:], 'text': word.strip(), 'start': item['start'], 'end': item['end']}

        else:
            # deal with O tag
            if current_entity:
                entities.append(current_entity)
            current_entity = None

    if current_entity:
        # add to entities
        entities.append(current_entity)

    # track entity merges
    merged_entities = []

    # merge entities of the same type
    for entity in entities:
        if merged_entities and merged_entities[-1]['type'] == entity['type'] and merged_entities[-1]['end'] == entity['start']:
            merged_entities[-1]['text'] += entity['text']
            merged_entities[-1]['end'] = entity['end']
        else:
            merged_entities.append(entity)

    # clean up extra spaces
    for entity in merged_entities:
        entity['text'] = ' '.join(entity['text'].split())

    # convert to list of dicts
    return [{'class': entity['type'],
             'entity_text': entity['text'],
             'start': entity['start'],
             'end': entity['end']} for entity in merged_entities]



example = """
Date of Application, 1st Aug., 1890-Accepted, 6th Sept., 1890
COMPLETE SPECIFICATION.
Improvements in Coin-freed Apparatus for the Sale of Goods.
I, CHARLES LOTINGA, of 33 Cambridge Street, Lower Grange, Cardiff, in the County of Glamorgan, Gentleman,
do hereby declare the nature of this invention and in what manner the same is to be performed,
to be particularly described and ascertained in and by the following statement
"""

ner_results = custom_recognizer(example)
print(ner_results)
```

### Training Data

The custom dataset of front page texts of patent specifications was assembled in the following steps:

1. We fine tuned a YOLO vision [model](https://huggingface.co/gbpatentdata/yolov8_patent_layouts) to detect bounding boxes around text. We use this to identify text regions on the front pages of patent specifications.
2. We use [Google Cloud Vision](https://cloud.google.com/vision?hl=en) to OCR the detected text regions, and then concatenate the OCR text.
3. We randomly sample 200 front page texts (and another 201 oversampled from those that contain either firm or communicant information).

Our custom dataset has accurate manual labels generated by a graduate student. The final dataset is split 60-20-20 (train-val-test). In the event that the front page text is too long, we restrict the text to the first 512 tokens.

### Evaluation 

Our evaluation metric is F1 at the full entity-level. That is, we aggregated adjacent-indexed entities into full entities and computed F1 scores requiring an exact match. These scores for the test set are below.

<table>
  <thead>
    <tr>
      <th>Full Entity</th>
      <th>Precision</th>
      <th>Recall</th>
      <th>F1-Score</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>TITLE</td>
      <td>93.9%</td>
      <td>97.5%</td>
      <td>95.7%</td>
    </tr>
  </tbody>
</table>


## Citation 

If you use our model or custom training/evaluation data in your research, please cite our accompanying paper as follows:

```bibtex
@article{bct2025,
  title = {300 Years of British Patents},
  author = {Enrico Berkes and Matthew Lee Chen and Matteo Tranchero},
  journal = {arXiv preprint arXiv:2401.12345},
  year = {2025},
  url = {https://arxiv.org/abs/2401.12345}
}
```