Text2Text Generation
Transformers
Safetensors
mt5
Inference Endpoints
File size: 6,760 Bytes
397655d
 
 
31fc237
 
a498569
31fc237
397655d
 
6e568fd
 
397655d
5525f4a
59dbe41
3c4a697
99af8ed
d8f63d4
3c4a697
39d57c7
6e568fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c4a697
6e568fd
 
 
 
 
 
 
 
 
 
 
 
3c4a697
6e568fd
 
 
 
 
 
 
 
 
 
 
 
3c4a697
6e568fd
 
 
 
 
 
 
 
 
ec05cf7
6e568fd
 
3c4a697
6e568fd
 
 
 
 
 
 
 
 
 
 
 
3c4a697
6e568fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e40b21f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6e568fd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
99af8ed
6e568fd
 
99af8ed
6e568fd
99af8ed
01cc525
99af8ed
 
 
 
 
 
76f90be
 
 
7a8372d
ab48977
7a8372d
ab48977
7a8372d
3c4a697
7a8372d
3c4a697
7a8372d
3c4a697
7a8372d
3c4a697
e40b21f
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
library_name: transformers
tags: []
pipeline_tag: text2text-generation
widget:
- text: Dành cho <extra_id_0> hàng th <extra_id_1>iết khi mua xe tay ga  Super Cub (khách hàng mua xe <extra_id_2>1/2017).</s> 🍓 Mua góp  <extra_id_3>ất  <extra_id_4> dẫn c <extra_id_5> từ  <extra_id_6></s> 🍓 Mua góp nhận <extra_id_7> vẹt gốc <extra_id_8></s>
  example_title: Example 1
---

# 5CD-AI/visocial-T5-base
## Overview
<!-- Provide a quick summary of what the model is/does. -->
We trimmed vocabulary size to 50,589 and continually pretrained `google/mt5-base`[1] on a merged 20GB dataset, the training dataset includes:
- Crawled data (100M comments and 15M posts on Facebook)
- UIT data[2], which is used to pretrain `uitnlp/visobert`[2]
- MC4 ecommerce
- 10.7M comments on VOZ Forum from `tarudesu/VOZ-HSD`[7]
- 3.6M reviews from Amazon[3] translated into Vietnamese from `5CD-AI/Vietnamese-amazon_polarity-gg-translated`
 
Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):
<table>
        <tr align="center">
            <td rowspan=2><b>Model</td>
            <td rowspan=2><b>Average MF1</td>
            <td colspan=3><b>Hate Speech Detection</td>
            <td colspan=3><b>Toxic Speech Detection</td>
            <td colspan=3><b>Hate Spans Detection</td>
        </tr>
        <tr align="center">
            <td><b>Acc</td>
            <td><b>WF1</td>
            <td><b>MF1</td>
            <td><b>Acc</td>
            <td><b>WF1</td>
            <td><b>MF1</td>
            <td><b>Acc</td>
            <td><b>WF1</td>
            <td><b>MF1</td>
        </tr>
        <tr align="center">
            <td align="left">PhoBERT[4]</td>
            <td>69.63</td>
            <td>86.75</td>
            <td>86.52</td>
            <td>64.76</td>
            <td>90.78</td>
            <td>90.27</td>
            <td>71.31</td>
            <td>84.65</td>
            <td>81.12</td>
            <td>72.81</td>
        </tr>
        <tr align="center">
            <td align="left">PhoBERT_v2[4]</td>
            <td>70.50</td>
            <td>87.42</td>
            <td>87.33</td>
            <td>66.60</td>
            <td>90.23</td>
            <td>89.78</td>
            <td>71.39</td>
            <td>84.92</td>
            <td>81.51</td>
            <td>73.51</td>
        </tr>
        <tr align="center">
            <td align="left">viBERT[5]</td>
            <td>67.80</td>
            <td>86.33</td>
            <td>85.79</td>
            <td>62.85</td>
            <td>88.81</td>
            <td>88.17</td>
            <td>67.65</td>
            <td>84.63</td>
            <td>81.28</td>
            <td>72.91</td>
        </tr>
        <tr align="center">
            <td align="left">ViSoBERT[6]</td>
            <td>75.07</td>
            <td>88.17</td>
            <td>87.86</td>
            <td>67.71</td>
            <td>90.35</td>
            <td>90.16</td>
            <td>71.45</td>
            <td>90.16</td>
            <td>90.07</td>
            <td>86.04</td>
        </tr>
        <tr align="center">
            <td align="left">ViHateT5[7]</td>
            <td>75.56</td>
            <td>88.76</td>
            <td>89.14</td>
            <td>68.67</td>
            <td>90.80</td>
            <td>91.78</td>
            <td>71.63</td>
            <td>91.00</td>
            <td>90.20</td>
            <td>86.37</td>
        </tr>
        <tr align="center">
            <td align="left"><b>visocial-T5-base(Ours)</b></td>
            <td><b>78.01</td>
            <td><b>89.51</td>
            <td><b>89.78</td>
            <td><b>71.19</td>
            <td><b>92.2</td>
            <td><b>93.47</td>
            <td><b>73.81</td>
            <td><b>92.57</td>
            <td><b>92.20</td>
            <td><b>89.04</td>
        </tr>
    </div>
</table>

Visocial-T5-base versus other T5-based models in terms of Vietnamese HSD-related task performance with Macro F1-score:

<table border="1" cellspacing="0" cellpadding="5">
    <tr align="center">
        <td rowspan=2><b>Model</b></td>
        <td colspan=3><b>MF1</b></td>
    </tr>
    <tr align="center">
        <td><b>Hate Speech Detection</b></td>
        <td><b>Toxic Speech Detection</b></td>
        <td><b>Hate Spans Detection</b></td>
    </tr>
    <tr align="center">
        <td align="left">mT5[1]</td>
        <td>66.76</td>
        <td>69.93</td>
        <td>86.60</td>
    </tr>
    <tr align="center">
        <td align="left">ViT5[8]</td>
        <td>66.95</td>
        <td>64.82</td>
        <td>86.90</td>
    </tr>
    <tr align="center">
        <td align="left">ViHateT5[7]</td>
        <td>68.67</td>
        <td>71.63</td>
        <td>86.37</td>
    </tr>
    <tr align="center">
        <td align="left"><b>visocial-T5-base(Ours)</td>
        <td><b>71.90</td>
        <td><b>73.81</td>
        <td><b>89.04</td>
    </tr>
</table>

<!-- ## Usage (HuggingFace Transformers)

Install `transformers` package:
    
    pip install transformers

Then you can use this model for fill-mask task like this:

```python
from transformers import pipeline

model_path = "5CD-AI/visobert-14gb-corpus"
mask_filler = pipeline("fill-mask", model_path)

mask_filler("shop làm ăn như cái <mask>", top_k=10)
``` -->

## Fine-tune Configuration
We fine-tune `5CD-AI/visocial-T5-base` on 3 downstream tasks with `transformers` library with the following configuration:
- seed: 42
- training_epochs: 4
- train_batch_size: 4 
- gradient_accumulation_steps: 8
- learning_rate: 3e-4
- lr_scheduler_type: linear
- model_max_length: 256
- metric_for_best_model: eval_loss
- evaluation_strategy: steps
- eval_steps=0.1

## References
[1] [mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934)

[2] [ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing](https://aclanthology.org/2023.emnlp-main.315/)

[3] [The Amazon Polarity dataset](https://paperswithcode.com/dataset/amazon-polarity-1)

[4] [PhoBERT: Pre-trained language models for Vietnamese](https://aclanthology.org/2020.findings-emnlp.92/)

[5] [Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models](https://arxiv.org/abs/2006.15994)

[6] [ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing](https://aclanthology.org/2023.emnlp-main.315/)

[7] [ViHateT5: Enhancing Hate Speech Detection in Vietnamese With A Unified Text-to-Text Transformer Model](https://arxiv.org/abs/2405.14141)

[8] [ViT5: Pretrained Text-to-Text Transformer for Vietnamese Language Generation](https://aclanthology.org/2022.naacl-srw.18/)