File size: 5,808 Bytes
bcdd65f
4a144a7
 
 
 
 
 
bcdd65f
e6b4816
 
c6af39f
4a144a7
8cc3529
 
bcdd65f
4a144a7
 
14a4dee
 
064a669
8de16b2
 
 
 
 
 
 
 
 
064a669
b3173cd
bc6cc02
b3173cd
 
 
 
 
064a669
 
 
b3173cd
 
 
14a4dee
6fffda9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14a4dee
a23fc97
14a4dee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
625c4cf
 
 
 
8a53833
 
 
 
 
 
 
 
 
 
 
 
 
 
625c4cf
14a4dee
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
---
language: 
  - grc
base_model:
  - UGARIT/grc-alignment
tags:
  - token-classification
license: mit
inference:
  parameters:
    aggregation_strategy: "first"
widget:
  - text: "ταῦτα εἴπας ὁ Ἀλέξανδρος παρίζει Πέρσῃ ἀνδρὶ ἄνδρα Μακεδόνα ὡς γυναῖκα τῷ λόγῳ · οἳ δέ , ἐπείτε σφέων οἱ Πέρσαι ψαύειν ἐπειρῶντο , διεργάζοντο αὐτούς ."
    example_title: "Example 1"
---
# Named Entity Recognition for Ancient Greek 

Pretrained NER tagging model for ancient Greek

# Data
We trained the models on available annotated corpora in Ancient Greek. 
There are only two sizeable annotated datasets in Ancient Greek, which are currently un- der release: the first one by Berti 2023, 
consists of a fully annotated text of Athenaeus’ Deipnosophists, developed in the context of the Digital Athenaeus project. 
The second one by Foka et al. 2020, is a fully annotated text of Pausanias’ Periegesis Hellados, developed in the context of the 
Digital Periegesis project. In addition, we used smaller corpora annotated by students and scholars on Recogito: 
the Odyssey annotated by Kemp 2021; a mixed corpus including excerpts from the Library attributed to Apollodorus and from Strabo’s Geography, 
annotated by Chiara Palladino; Book 1 of Xenophon’s Anabasis, created by Thomas Visser; and Demos- thenes’ Against Neaira, 
created by Rachel Milio.

### Training Dataset
|                | **Person**       | **Location**      | **NORP**          | **MISC**          |
|----------------|------------------|-------------------|-------------------|-------------------|
| Odyssey        | 2.469            | 698               | 0                 | 0                 |
| Deipnosophists | 14.921           | 2.699             | 5.110             | 3.060             |
| Pausanias      | 10.205           | 8.670             | 4.972             | 0                 |
| Other Datasets | 3.283            | 2.040             | 1.089             | 0                 |
| **Total**      | **30.878**       | **14.107**        | **11.171**        | **3.060**         |

---
### Validation Dataset
|                | **Person**       |      **Location** | **NORP**          | **MISC**          |
|----------------|------------------|-------------------|-------------------|-------------------|
| Xenophon       | 1.190            | 796               | 857               | 0                 |

# Results
| Class   | Metric | Test | Validation |
|---------|-----------|--------|--------|
| **LOC**     | precision | 83.33% | 88.66% |
|         | recall    | 81.27% | 88.94% |
|         | f1        | 82.29% | 88.80% |
| **MISC**    | precision | 83.25% | 0      |
|         | recall    | 81.21% | 0      |
|         | f1        | 82.22% | 0      |
| **NORP**    | precision | 88.71% | 94.76% |
|         | recall    | 90.76% | 94.50% |
|         | f1        | 89.73% | 94.63% |
| **PER**     | precision | 91.72% | 94.22% |
|         | recall    | 94.42% | 96.06% |
|         | f1        | 93.05% | 95.13% |
| **Overall** | precision | 88.83% | 92.91% |
|         | recall    | 89.99% | 93.72% |
|         | f1        | 89.41% | 93.32% |
|         | Accuracy  | 97.50% | 98.87% |


# Usage
This [colab notebook](https://colab.research.google.com/drive/1K6ER_C8d_AxBm0Yrtr628P3weH1Rxhht?usp=sharing) contains the necessary code to use the model.
```python
from transformers import pipeline

# create pipeline for NER
ner = pipeline('ner', model="UGARIT/grc-ner-xlmr", aggregation_strategy = 'first')
ner("ταῦτα εἴπας ὁ Ἀλέξανδρος παρίζει Πέρσῃ ἀνδρὶ ἄνδρα Μακεδόνα ὡς γυναῖκα τῷ λόγῳ · οἳ δέ , ἐπείτε σφέων οἱ Πέρσαι ψαύειν ἐπειρῶντο , διεργάζοντο αὐτούς .")
```

Output
```
[{'entity_group': 'PER',
  'score': 0.9999428,
  'word': '',
  'start': 13,
  'end': 14},
 {'entity_group': 'PER',
  'score': 0.99994195,
  'word': 'Ἀλέξανδρος',
  'start': 14,
  'end': 24},
 {'entity_group': 'NORP',
  'score': 0.9087087,
  'word': 'Πέρσῃ',
  'start': 32,
  'end': 38},
 {'entity_group': 'NORP',
  'score': 0.97572577,
  'word': 'Μακεδόνα',
  'start': 50,
  'end': 59},
 {'entity_group': 'NORP',
  'score': 0.9993412,
  'word': 'Πέρσαι',
  'start': 104,
  'end': 111}]
```

# Citation:
```
@inproceedings{palladino-yousef-2024-development,
    title = "Development of Robust {NER} Models and Named Entity Tagsets for {A}ncient {G}reek",
    author = "Palladino, Chiara  and
      Yousef, Tariq",
    editor = "Sprugnoli, Rachele  and
      Passarotti, Marco",
    booktitle = "Proceedings of the Third Workshop on Language Technologies for Historical and Ancient Languages (LT4HALA) @ LREC-COLING-2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.lt4hala-1.11",
    pages = "89--97",
    abstract = "This contribution presents a novel approach to the development and evaluation of transformer-based models for Named Entity Recognition and Classification in Ancient Greek texts. We trained two models with annotated datasets by consolidating potentially ambiguous entity types under a harmonized set of classes. Then, we tested their performance with out-of-domain texts, reproducing a real-world use case. Both models performed very well under these conditions, with the multilingual model being slightly superior on the monolingual one. In the conclusion, we emphasize current limitations due to the scarcity of high-quality annotated corpora and to the lack of cohesive annotation strategies for ancient languages.",
}
```