File size: 3,799 Bytes
8cadda8 248955a 057d46b 248955a 8cadda8 057d46b 8cadda8 057d46b fc49e52 248955a 8cadda8 248955a 8cadda8 248955a 8cadda8 248955a 8cadda8 248955a 8cadda8 248955a 8cadda8 248955a 8cadda8 057d46b 8cadda8 a1ea1a1 8cadda8 b107b15 8cadda8 fc49e52 8cadda8 fc49e52 8cadda8 057d46b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
language:
- km
license: apache-2.0
library_name: transformers
tags:
- generated_from_trainer
datasets:
- seanghay/khPOS
metrics:
- precision
- recall
- f1
- accuracy
widget:
- text: គាត់ផឹកទឹកនៅភ្នំពេញ
- text: តើលោកស្រីបានសាកសួរទៅគាត់ទេ?
- text: នេត្រា មិនដឹងសោះថាអ្នកជាមនុស្ស!
- text: លោក វណ្ណ ម៉ូលីវណ្ណ ជាបិតាស្ថាបត្យកម្មដ៏ល្បីល្បាញរបស់ប្រទេសកម្ពុជានៅក្នុងសម័យសង្គមរាស្ត្រនិយម។
pipeline_tag: token-classification
base_model: xlm-roberta-base
model-index:
- name: khmer-pos-roberta-10
results:
- task:
type: token-classification
name: Token Classification
dataset:
name: kh_pos
type: kh_pos
config: default
split: train
args: default
metrics:
- type: precision
value: 0.9511876225757245
name: Precision
- type: recall
value: 0.9526407682234832
name: Recall
- type: f1
value: 0.9519136408243376
name: F1
- type: accuracy
value: 0.9735370853522176
name: Accuracy
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# Khmer Part of Speech Tagging with XLM RoBERTa
This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) on the [khPOS](https://huggingface.co/datasets/seanghay/khPOS) dataset.
It achieves the following results on the evaluation set:
- Loss: 0.1063
- Precision: 0.9512
- Recall: 0.9526
- F1: 0.9519
- Accuracy: 0.9735
## Model description
The [original paper](https://arxiv.org/pdf/2103.16801.pdf) achieved 98.15% accuracy while this model achieved only 97.35% which is close. However, this is a multilingual model so it has more tokens than the original paper.
## Intended uses & limitations
This model can be used to extract useful information from Khmer text.
## Training and evaluation data
train: 90% / test: 10%
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 24
- eval_batch_size: 16
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 10
### Training results
| Training Loss | Epoch | Step | Validation Loss | Precision | Recall | F1 | Accuracy |
|:-------------:|:-----:|:----:|:---------------:|:---------:|:------:|:------:|:--------:|
| No log | 1.0 | 450 | 0.1347 | 0.9314 | 0.9333 | 0.9324 | 0.9603 |
| 0.4834 | 2.0 | 900 | 0.1183 | 0.9407 | 0.9377 | 0.9392 | 0.9653 |
| 0.1323 | 3.0 | 1350 | 0.1026 | 0.9484 | 0.9482 | 0.9483 | 0.9699 |
| 0.095 | 4.0 | 1800 | 0.0986 | 0.9502 | 0.9490 | 0.9496 | 0.9712 |
| 0.0774 | 5.0 | 2250 | 0.0978 | 0.9494 | 0.9491 | 0.9493 | 0.9712 |
| 0.0616 | 6.0 | 2700 | 0.0991 | 0.9493 | 0.9507 | 0.9500 | 0.9715 |
| 0.0494 | 7.0 | 3150 | 0.0989 | 0.9529 | 0.9540 | 0.9534 | 0.9731 |
| 0.0414 | 8.0 | 3600 | 0.1037 | 0.9499 | 0.9501 | 0.9500 | 0.9722 |
| 0.0339 | 9.0 | 4050 | 0.1056 | 0.9516 | 0.9517 | 0.9516 | 0.9734 |
| 0.029 | 10.0 | 4500 | 0.1063 | 0.9512 | 0.9526 | 0.9519 | 0.9735 |
### Framework versions
- Transformers 4.30.2
- Pytorch 2.0.1+cu118
- Datasets 2.13.1
- Tokenizers 0.13.3 |