ljvmiranda921
/

tl_calamancy_trf

Token Classification

Model card Files Files and versions Community

ljvmiranda921 commited on 24 days ago

Commit

871b753

·

verified ·

1 Parent(s): dadcd45

Update README.md

Files changed (1) hide show

README.md +38 -2

README.md CHANGED Viewed

@@ -5,8 +5,25 @@ tags:
 language:
 - tl
 license: mit
 ---
-calamanCy: Tagalog NLP pipelines in spaCy
 | Feature | Description |
 | --- | --- |
@@ -33,4 +50,23 @@ calamanCy: Tagalog NLP pipelines in spaCy
 | **`parser`** | `ROOT`, `acl`, `acl:relcl`, `advcl`, `advmod`, `amod`, `appos`, `case`, `cc`, `ccomp`, `compound`, `compound:redup`, `conj`, `dep`, `det`, `discourse`, `dislocated`, `fixed`, `flat`, `goeswith`, `list`, `mark`, `nmod`, `nmod:poss`, `nsubj`, `nummod`, `obj`, `obj:agent`, `obl`, `orphan`, `parataxis`, `punct`, `vocative`, `xcomp` |
 | **`ner`** | `LOC`, `ORG`, `PER` |
-</details>

 language:
 - tl
 license: mit
+datasets:
+- UD-Filipino/UD_Tagalog-NewsCrawl
+- ljvmiranda921/tlunified-ner
+- SEACrowd/tlunified_ner
+base_model:
+- microsoft/mdeberta-v3-base
+pipeline_tag: token-classification
+library_name: spacy
 ---
+<img src="https://raw.githubusercontent.com/ljvmiranda921/calamanCy/refs/heads/master/logo.png" width="130" height="130" align="right" />
+# calamanCy: Tagalog NLP pipelines in spaCy
+This is the latest **transformer-based pipeline** for [calamanCy](https://arxiv.org/abs/2311.07171).
+Compared to the 0.1.0 version, this pipeline is trained on a larger treebank ([UD-NewsCrawl](https://huggingface.co/datasets/UD-Filipino/UD_Tagalog-NewsCrawl)), with large improvements in dependency parsing, morphological annotation, and POS tagging.
+This pipeline also implements a neural edit-tree lemmatizer, allowing better lemmatization than the previous model.
+The training code can be found [in GitHub](https://github.com/ljvmiranda921/calamanCy/tree/master/models/v0.1.0).
 | Feature | Description |
 | --- | --- |
 | **`parser`** | `ROOT`, `acl`, `acl:relcl`, `advcl`, `advmod`, `amod`, `appos`, `case`, `cc`, `ccomp`, `compound`, `compound:redup`, `conj`, `dep`, `det`, `discourse`, `dislocated`, `fixed`, `flat`, `goeswith`, `list`, `mark`, `nmod`, `nmod:poss`, `nsubj`, `nummod`, `obj`, `obj:agent`, `obl`, `orphan`, `parataxis`, `punct`, `vocative`, `xcomp` |
 | **`ner`** | `LOC`, `ORG`, `PER` |
+</details>
+### Citation
+If you're using this model, please cite:
+```
+@inproceedings{miranda-2023-calamancy,
+    title = "calaman{C}y: A {T}agalog Natural Language Processing Toolkit",
+    author = "Miranda, Lester James",
+    booktitle = "Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)",
+    month = dec,
+    year = "2023",
+    address = "Singapore",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2023.nlposs-1.1/",
+    doi = "10.18653/v1/2023.nlposs-1.1",
+    pages = "1--7",
+}
+```