|
--- |
|
library_name: transformers |
|
tags: [] |
|
pipeline_tag: text2text-generation |
|
widget: |
|
- text: Dành cho <extra_id_0> hàng th <extra_id_1>iết khi mua xe tay ga và Super Cub (khách hàng mua xe <extra_id_2>1/2017).</s> 🍓 Mua góp lã <extra_id_3>ất <extra_id_4> dẫn c <extra_id_5> từ <extra_id_6></s> 🍓 Mua góp nhận <extra_id_7> vẹt gốc <extra_id_8></s> |
|
example_title: Example 1 |
|
--- |
|
|
|
# 5CD-AI/visocial-T5-base |
|
## Overview |
|
<!-- Provide a quick summary of what the model is/does. --> |
|
We trimmed vocabulary size to 50,589 and continually pretrained `google/mt5-base`[1] on a merged 20GB dataset, the training dataset includes: |
|
- Crawled data (100M comments and 15M posts on Facebook) |
|
- UIT data[2], which is used to pretrain `uitnlp/visobert`[2] |
|
- MC4 ecommerce |
|
- 10.7M comments on VOZ Forum from `tarudesu/VOZ-HSD`[7] |
|
- 3.6M reviews from Amazon[3] translated into Vietnamese from `5CD-AI/Vietnamese-amazon_polarity-gg-translated` |
|
|
|
Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS): |
|
<table> |
|
<tr align="center"> |
|
<td rowspan=2><b>Model</td> |
|
<td rowspan=2><b>Average MF1</td> |
|
<td colspan=3><b>Hate Speech Detection</td> |
|
<td colspan=3><b>Toxic Speech Detection</td> |
|
<td colspan=3><b>Hate Spans Detection</td> |
|
</tr> |
|
<tr align="center"> |
|
<td><b>Acc</td> |
|
<td><b>WF1</td> |
|
<td><b>MF1</td> |
|
<td><b>Acc</td> |
|
<td><b>WF1</td> |
|
<td><b>MF1</td> |
|
<td><b>Acc</td> |
|
<td><b>WF1</td> |
|
<td><b>MF1</td> |
|
</tr> |
|
<tr align="center"> |
|
<td align="left">PhoBERT[4]</td> |
|
<td>69.63</td> |
|
<td>86.75</td> |
|
<td>86.52</td> |
|
<td>64.76</td> |
|
<td>90.78</td> |
|
<td>90.27</td> |
|
<td>71.31</td> |
|
<td>84.65</td> |
|
<td>81.12</td> |
|
<td>72.81</td> |
|
</tr> |
|
<tr align="center"> |
|
<td align="left">PhoBERT_v2[4]</td> |
|
<td>70.50</td> |
|
<td>87.42</td> |
|
<td>87.33</td> |
|
<td>66.60</td> |
|
<td>90.23</td> |
|
<td>89.78</td> |
|
<td>71.39</td> |
|
<td>84.92</td> |
|
<td>81.51</td> |
|
<td>73.51</td> |
|
</tr> |
|
<tr align="center"> |
|
<td align="left">viBERT[5]</td> |
|
<td>67.80</td> |
|
<td>86.33</td> |
|
<td>85.79</td> |
|
<td>62.85</td> |
|
<td>88.81</td> |
|
<td>88.17</td> |
|
<td>67.65</td> |
|
<td>84.63</td> |
|
<td>81.28</td> |
|
<td>72.91</td> |
|
</tr> |
|
<tr align="center"> |
|
<td align="left">ViSoBERT[6]</td> |
|
<td>75.07</td> |
|
<td>88.17</td> |
|
<td>87.86</td> |
|
<td>67.71</td> |
|
<td>90.35</td> |
|
<td>90.16</td> |
|
<td>71.45</td> |
|
<td>90.16</td> |
|
<td>90.07</td> |
|
<td>86.04</td> |
|
</tr> |
|
<tr align="center"> |
|
<td align="left">ViHateT5[7]</td> |
|
<td>75.56</td> |
|
<td>88.76</td> |
|
<td>89.14</td> |
|
<td>68.67</td> |
|
<td>90.80</td> |
|
<td>91.78</td> |
|
<td>71.63</td> |
|
<td>91.00</td> |
|
<td>90.20</td> |
|
<td>86.37</td> |
|
</tr> |
|
<tr align="center"> |
|
<td align="left"><b>visocial-T5-base(Ours)</b></td> |
|
<td><b>78.01</td> |
|
<td><b>89.51</td> |
|
<td><b>89.78</td> |
|
<td><b>71.19</td> |
|
<td><b>92.2</td> |
|
<td><b>93.47</td> |
|
<td><b>73.81</td> |
|
<td><b>92.57</td> |
|
<td><b>92.20</td> |
|
<td><b>89.04</td> |
|
</tr> |
|
</div> |
|
</table> |
|
|
|
Visocial-T5-base versus other T5-based models in terms of Vietnamese HSD-related task performance with Macro F1-score: |
|
|
|
<table border="1" cellspacing="0" cellpadding="5"> |
|
<tr align="center"> |
|
<td rowspan=2><b>Model</b></td> |
|
<td colspan=3><b>MF1</b></td> |
|
</tr> |
|
<tr align="center"> |
|
<td><b>Hate Speech Detection</b></td> |
|
<td><b>Toxic Speech Detection</b></td> |
|
<td><b>Hate Spans Detection</b></td> |
|
</tr> |
|
<tr align="center"> |
|
<td align="left">mT5[1]</td> |
|
<td>66.76</td> |
|
<td>69.93</td> |
|
<td>86.60</td> |
|
</tr> |
|
<tr align="center"> |
|
<td align="left">ViT5[8]</td> |
|
<td>66.95</td> |
|
<td>64.82</td> |
|
<td>86.90</td> |
|
</tr> |
|
<tr align="center"> |
|
<td align="left">ViHateT5[7]</td> |
|
<td>68.67</td> |
|
<td>71.63</td> |
|
<td>86.37</td> |
|
</tr> |
|
<tr align="center"> |
|
<td align="left"><b>visocial-T5-base(Ours)</td> |
|
<td><b>71.90</td> |
|
<td><b>73.81</td> |
|
<td><b>89.04</td> |
|
</tr> |
|
</table> |
|
|
|
<!-- ## Usage (HuggingFace Transformers) |
|
|
|
Install `transformers` package: |
|
|
|
pip install transformers |
|
|
|
Then you can use this model for fill-mask task like this: |
|
|
|
```python |
|
from transformers import pipeline |
|
|
|
model_path = "5CD-AI/visobert-14gb-corpus" |
|
mask_filler = pipeline("fill-mask", model_path) |
|
|
|
mask_filler("shop làm ăn như cái <mask>", top_k=10) |
|
``` --> |
|
|
|
## Fine-tune Configuration |
|
We fine-tune `5CD-AI/visocial-T5-base` on 3 downstream tasks with `transformers` library with the following configuration: |
|
- seed: 42 |
|
- training_epochs: 4 |
|
- train_batch_size: 4 |
|
- gradient_accumulation_steps: 8 |
|
- learning_rate: 3e-4 |
|
- lr_scheduler_type: linear |
|
- model_max_length: 256 |
|
- metric_for_best_model: eval_loss |
|
- evaluation_strategy: steps |
|
- eval_steps=0.1 |
|
|
|
## References |
|
[1] [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934) |
|
|
|
[2] [ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing](https://aclanthology.org/2023.emnlp-main.315/) |
|
|
|
[3] [The Amazon Polarity dataset](https://paperswithcode.com/dataset/amazon-polarity-1) |
|
|
|
[4] [PhoBERT: Pre-trained language models for Vietnamese](https://aclanthology.org/2020.findings-emnlp.92/) |
|
|
|
[5] [Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models](https://arxiv.org/abs/2006.15994) |
|
|
|
[6] [ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing](https://aclanthology.org/2023.emnlp-main.315/) |
|
|
|
[7] [ViHateT5: Enhancing Hate Speech Detection in Vietnamese With A Unified Text-to-Text Transformer Model](https://arxiv.org/abs/2405.14141) |
|
|
|
[8] [ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation](https://aclanthology.org/2022.naacl-srw.18/) |
|
|