Update README.md
Browse files
README.md
CHANGED
@@ -6,12 +6,12 @@ tags: []
|
|
6 |
# 5CD-AI/visocial-T5-base
|
7 |
## Overview
|
8 |
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
-
We continually pretrain `google/mt5-base` on a merged 20GB dataset, the training dataset includes:
|
10 |
- Internal data (100M comments and 15M posts on Facebook)
|
11 |
-
- UIT data, which is used to pretrain `uitnlp/visobert`
|
12 |
- MC4 ecommerce
|
13 |
- 10.7M comments on VOZ Forum from `tarudesu/VOZ-HSD`
|
14 |
-
- 3.6M reviews from Amazon translated into Vietnamese from `5CD-AI/Vietnamese-amazon_polarity-gg-translated`
|
15 |
|
16 |
Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):
|
17 |
<table>
|
@@ -142,4 +142,9 @@ We fine-tune `5CD-AI/visocial-T5-base` on 3 downstream tasks with `transformers`
|
|
142 |
- model_max_length: 256
|
143 |
- metric_for_best_model: eval_loss
|
144 |
- evaluation_strategy: steps
|
145 |
-
- eval_steps=0.1
|
|
|
|
|
|
|
|
|
|
|
|
6 |
# 5CD-AI/visocial-T5-base
|
7 |
## Overview
|
8 |
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
+
We continually pretrain `google/mt5-base`[1] on a merged 20GB dataset, the training dataset includes:
|
10 |
- Internal data (100M comments and 15M posts on Facebook)
|
11 |
+
- UIT data [2], which is used to pretrain `uitnlp/visobert`[2]
|
12 |
- MC4 ecommerce
|
13 |
- 10.7M comments on VOZ Forum from `tarudesu/VOZ-HSD`
|
14 |
+
- 3.6M reviews from Amazon[3] translated into Vietnamese from `5CD-AI/Vietnamese-amazon_polarity-gg-translated`
|
15 |
|
16 |
Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):
|
17 |
<table>
|
|
|
142 |
- model_max_length: 256
|
143 |
- metric_for_best_model: eval_loss
|
144 |
- evaluation_strategy: steps
|
145 |
+
- eval_steps=0.1
|
146 |
+
|
147 |
+
## References
|
148 |
+
[1][mT5: A massively multilingual pre-trained text-to-text transformer](https://arxiv.org/abs/2010.11934)
|
149 |
+
[2][ViSoBERT: A Pre-Trained Language Model for Vietnamese Social Media Text Processing](https://aclanthology.org/2023.emnlp-main.315/)
|
150 |
+
[3][The Amazon Polarity dataset](https://paperswithcode.com/dataset/amazon-polarity-1)
|