Update README.md
Browse files
README.md
CHANGED
@@ -6,10 +6,12 @@ tags: []
|
|
6 |
# 5CD-AI/visocial-T5-base
|
7 |
## Overview
|
8 |
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
-
|
10 |
- Internal data (100M comments and 15M posts on Facebook)
|
11 |
- UIT data, which is used to pretrain `uitnlp/visobert`
|
12 |
-
- MC4 ecommerce
|
|
|
|
|
13 |
|
14 |
Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):
|
15 |
<table>
|
@@ -127,21 +129,17 @@ model_path = "5CD-AI/visobert-14gb-corpus"
|
|
127 |
mask_filler = pipeline("fill-mask", model_path)
|
128 |
|
129 |
mask_filler("shop làm ăn như cái <mask>", top_k=10)
|
130 |
-
```
|
131 |
|
132 |
## Fine-tune Configuration
|
133 |
-
We fine-tune `5CD-AI/
|
134 |
- seed: 42
|
135 |
-
-
|
136 |
-
-
|
137 |
-
-
|
138 |
-
-
|
139 |
-
-
|
140 |
-
-
|
141 |
-
- metric_for_best_model:
|
142 |
-
-
|
143 |
-
|
144 |
-
And different additional configurations for each task:
|
145 |
-
| Emotion Recognition | Hate Speech Detection | Spam Reviews Detection | Hate Speech Spans Detection |
|
146 |
-
| --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
|
147 |
-
|\- train_batch_size: 64<br>\- lr_scheduler_type: linear | \- train_batch_size: 32<br>\- lr_scheduler_type: linear | \- train_batch_size: 32<br>\- lr_scheduler_type: cosine | \- train_batch_size: 32<br>\- lr_scheduler_type: cosine | -->
|
|
|
6 |
# 5CD-AI/visocial-T5-base
|
7 |
## Overview
|
8 |
<!-- Provide a quick summary of what the model is/does. -->
|
9 |
+
We continually pretrain `google/mt5-base` on a merged 20GB dataset, the training dataset includes:
|
10 |
- Internal data (100M comments and 15M posts on Facebook)
|
11 |
- UIT data, which is used to pretrain `uitnlp/visobert`
|
12 |
+
- MC4 ecommerce
|
13 |
+
- 10.7M comments on VOZ Forum from `tarudesu/VOZ-HSD`
|
14 |
+
- 3.6M reviews from Amazon translated into Vietnamese from `5CD-AI/Vietnamese-amazon_polarity-gg-translated`
|
15 |
|
16 |
Here are the results on 3 downstream tasks on Vietnamese social media texts, including Hate Speech Detection(UIT-HSD), Toxic Speech Detection(ViCTSD), Hate Spans Detection(ViHOS):
|
17 |
<table>
|
|
|
129 |
mask_filler = pipeline("fill-mask", model_path)
|
130 |
|
131 |
mask_filler("shop làm ăn như cái <mask>", top_k=10)
|
132 |
+
``` -->
|
133 |
|
134 |
## Fine-tune Configuration
|
135 |
+
We fine-tune `5CD-AI/visocial-T5-base` on 3 downstream tasks with `transformers` library with the following configuration:
|
136 |
- seed: 42
|
137 |
+
- training_epochs: 4
|
138 |
+
- train_batch_size: 4
|
139 |
+
- gradient_accumulation_steps: 8
|
140 |
+
- learning_rate: 3e-4
|
141 |
+
- lr_scheduler_type: linear
|
142 |
+
- model_max_length: 256
|
143 |
+
- metric_for_best_model: eval_loss
|
144 |
+
- evaluation_strategy: steps
|
145 |
+
- eval_steps=0.1
|
|
|
|
|
|
|
|