Update README.md
Browse files
README.md
CHANGED
@@ -6,12 +6,12 @@ tags: []
|
|
6 |
# 5CD-AI/visocial-T5-base
|
7 |
## Overview
|
8 |
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
-
We continually pretrain `google/mt5-base`
|
10 |
- Internal data (100M comments and 15M posts on Facebook)
|
11 |
-
- UIT data
|
12 |
- MC4 ecommerce
|
13 |
- 10.7M comments on VOZ Forum from `tarudesu/VOZ-HSD`
|
14 |
-
- 3.6M reviews from Amazon
|
15 |
|
16 |
Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):
|
17 |
<table>
|
@@ -34,7 +34,7 @@ Here are the results on 3 downstream tasks on Vietnamese social media texts, inc
|
|
34 |
<td><b>MF1</td>
|
35 |
</tr>
|
36 |
<tr align="center">
|
37 |
-
<td align="left">PhoBERT</td>
|
38 |
<td>69.63</td>
|
39 |
<td>86.75</td>
|
40 |
<td>86.52</td>
|
@@ -47,7 +47,7 @@ Here are the results on 3 downstream tasks on Vietnamese social media texts, inc
|
|
47 |
<td>72.81</td>
|
48 |
</tr>
|
49 |
<tr align="center">
|
50 |
-
<td align="left">PhoBERT_v2</td>
|
51 |
<td>70.50</td>
|
52 |
<td>87.42</td>
|
53 |
<td>87.33</td>
|
@@ -60,7 +60,7 @@ Here are the results on 3 downstream tasks on Vietnamese social media texts, inc
|
|
60 |
<td>73.51</td>
|
61 |
</tr>
|
62 |
<tr align="center">
|
63 |
-
<td align="left">viBERT</td>
|
64 |
<td>67.80</td>
|
65 |
<td>86.33</td>
|
66 |
<td>85.79</td>
|
@@ -73,7 +73,7 @@ Here are the results on 3 downstream tasks on Vietnamese social media texts, inc
|
|
73 |
<td>72.91</td>
|
74 |
</tr>
|
75 |
<tr align="center">
|
76 |
-
<td align="left">ViSoBERT</td>
|
77 |
<td>75.07</td>
|
78 |
<td>88.17</td>
|
79 |
<td>87.86</td>
|
@@ -86,7 +86,7 @@ Here are the results on 3 downstream tasks on Vietnamese social media texts, inc
|
|
86 |
<td>86.04</td>
|
87 |
</tr>
|
88 |
<tr align="center">
|
89 |
-
<td align="left">ViHateT5</td>
|
90 |
<td>75.56</td>
|
91 |
<td>88.76</td>
|
92 |
<td>89.14</td>
|
@@ -149,4 +149,12 @@ We fine-tune `5CD-AI/visocial-T5-base` on 3 downstream tasks with `transformers`
|
|
149 |
|
150 |
[2][ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing](https://aclanthology.org/2023.emnlp-main.315/)
|
151 |
|
152 |
-
[3][The Amazon Polarity dataset](https://paperswithcode.com/dataset/amazon-polarity-1)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
# 5CD-AI/visocial-T5-base
|
7 |
## Overview
|
8 |
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
+
We continually pretrain `google/mt5-base`[1] on a merged 20GB dataset, the training dataset includes:
|
10 |
- Internal data (100M comments and 15M posts on Facebook)
|
11 |
+
- UIT data[2], which is used to pretrain `uitnlp/visobert`[2]
|
12 |
- MC4 ecommerce
|
13 |
- 10.7M comments on VOZ Forum from `tarudesu/VOZ-HSD`
|
14 |
+
- 3.6M reviews from Amazon[3] translated into Vietnamese from `5CD-AI/Vietnamese-amazon_polarity-gg-translated`
|
15 |
|
16 |
Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):
|
17 |
<table>
|
|
|
34 |
<td><b>MF1</td>
|
35 |
</tr>
|
36 |
<tr align="center">
|
37 |
+
<td align="left">PhoBERT[4]</td>
|
38 |
<td>69.63</td>
|
39 |
<td>86.75</td>
|
40 |
<td>86.52</td>
|
|
|
47 |
<td>72.81</td>
|
48 |
</tr>
|
49 |
<tr align="center">
|
50 |
+
<td align="left">PhoBERT_v2[4]</td>
|
51 |
<td>70.50</td>
|
52 |
<td>87.42</td>
|
53 |
<td>87.33</td>
|
|
|
60 |
<td>73.51</td>
|
61 |
</tr>
|
62 |
<tr align="center">
|
63 |
+
<td align="left">viBERT[5]</td>
|
64 |
<td>67.80</td>
|
65 |
<td>86.33</td>
|
66 |
<td>85.79</td>
|
|
|
73 |
<td>72.91</td>
|
74 |
</tr>
|
75 |
<tr align="center">
|
76 |
+
<td align="left">ViSoBERT[6]</td>
|
77 |
<td>75.07</td>
|
78 |
<td>88.17</td>
|
79 |
<td>87.86</td>
|
|
|
86 |
<td>86.04</td>
|
87 |
</tr>
|
88 |
<tr align="center">
|
89 |
+
<td align="left">ViHateT5[7]</td>
|
90 |
<td>75.56</td>
|
91 |
<td>88.76</td>
|
92 |
<td>89.14</td>
|
|
|
149 |
|
150 |
[2][ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing](https://aclanthology.org/2023.emnlp-main.315/)
|
151 |
|
152 |
+
[3][The Amazon Polarity dataset](https://paperswithcode.com/dataset/amazon-polarity-1)
|
153 |
+
|
154 |
+
[4][PhoBERT: Pre-trained language models for Vietnamese](https://aclanthology.org/2020.findings-emnlp.92/)
|
155 |
+
|
156 |
+
[5][Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models](https://arxiv.org/abs/2006.15994)
|
157 |
+
|
158 |
+
[6][ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing](https://aclanthology.org/2023.emnlp-main.315/)
|
159 |
+
|
160 |
+
[7][ViHateT5: Enhancing Hate Speech Detection in Vietnamese With A Unified Text-to-Text Transformer Model](https://arxiv.org/abs/2405.14141)
|