Token Classification
spaCy
Tagalog
ljvmiranda921 commited on
Commit
871b753
·
verified ·
1 Parent(s): dadcd45

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -2
README.md CHANGED
@@ -5,8 +5,25 @@ tags:
5
  language:
6
  - tl
7
  license: mit
 
 
 
 
 
 
 
 
8
  ---
9
- calamanCy: Tagalog NLP pipelines in spaCy
 
 
 
 
 
 
 
 
 
10
 
11
  | Feature | Description |
12
  | --- | --- |
@@ -33,4 +50,23 @@ calamanCy: Tagalog NLP pipelines in spaCy
33
  | **`parser`** | `ROOT`, `acl`, `acl:relcl`, `advcl`, `advmod`, `amod`, `appos`, `case`, `cc`, `ccomp`, `compound`, `compound:redup`, `conj`, `dep`, `det`, `discourse`, `dislocated`, `fixed`, `flat`, `goeswith`, `list`, `mark`, `nmod`, `nmod:poss`, `nsubj`, `nummod`, `obj`, `obj:agent`, `obl`, `orphan`, `parataxis`, `punct`, `vocative`, `xcomp` |
34
  | **`ner`** | `LOC`, `ORG`, `PER` |
35
 
36
- </details>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  language:
6
  - tl
7
  license: mit
8
+ datasets:
9
+ - UD-Filipino/UD_Tagalog-NewsCrawl
10
+ - ljvmiranda921/tlunified-ner
11
+ - SEACrowd/tlunified_ner
12
+ base_model:
13
+ - microsoft/mdeberta-v3-base
14
+ pipeline_tag: token-classification
15
+ library_name: spacy
16
  ---
17
+
18
+ <img src="https://raw.githubusercontent.com/ljvmiranda921/calamanCy/refs/heads/master/logo.png" width="130" height="130" align="right" />
19
+
20
+ # calamanCy: Tagalog NLP pipelines in spaCy
21
+
22
+ This is the latest **transformer-based pipeline** for [calamanCy](https://arxiv.org/abs/2311.07171).
23
+ Compared to the 0.1.0 version, this pipeline is trained on a larger treebank ([UD-NewsCrawl](https://huggingface.co/datasets/UD-Filipino/UD_Tagalog-NewsCrawl)), with large improvements in dependency parsing, morphological annotation, and POS tagging.
24
+ This pipeline also implements a neural edit-tree lemmatizer, allowing better lemmatization than the previous model.
25
+ The training code can be found [in GitHub](https://github.com/ljvmiranda921/calamanCy/tree/master/models/v0.1.0).
26
+
27
 
28
  | Feature | Description |
29
  | --- | --- |
 
50
  | **`parser`** | `ROOT`, `acl`, `acl:relcl`, `advcl`, `advmod`, `amod`, `appos`, `case`, `cc`, `ccomp`, `compound`, `compound:redup`, `conj`, `dep`, `det`, `discourse`, `dislocated`, `fixed`, `flat`, `goeswith`, `list`, `mark`, `nmod`, `nmod:poss`, `nsubj`, `nummod`, `obj`, `obj:agent`, `obl`, `orphan`, `parataxis`, `punct`, `vocative`, `xcomp` |
51
  | **`ner`** | `LOC`, `ORG`, `PER` |
52
 
53
+ </details>
54
+
55
+ ### Citation
56
+
57
+ If you're using this model, please cite:
58
+
59
+ ```
60
+ @inproceedings{miranda-2023-calamancy,
61
+ title = "calaman{C}y: A {T}agalog Natural Language Processing Toolkit",
62
+ author = "Miranda, Lester James",
63
+ booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
64
+ month = dec,
65
+ year = "2023",
66
+ address = "Singapore",
67
+ publisher = "Association for Computational Linguistics",
68
+ url = "https://aclanthology.org/2023.nlposs-1.1/",
69
+ doi = "10.18653/v1/2023.nlposs-1.1",
70
+ pages = "1--7",
71
+ }
72
+ ```