Update README.md
Browse files
README.md
CHANGED
@@ -57,9 +57,8 @@ metrics:
|
|
57 |
---
|
58 |
# RoBERTaLexPT-base
|
59 |
|
60 |
-
|
61 |
|
62 |
-
This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
|
63 |
|
64 |
## Model Details
|
65 |
|
@@ -86,7 +85,8 @@ This modelcard aims to be a base template for new models. It has been generated
|
|
86 |
|
87 |
### Training Procedure
|
88 |
|
89 |
-
|
|
|
90 |
|
91 |
#### Preprocessing [optional]
|
92 |
|
@@ -95,8 +95,25 @@ This modelcard aims to be a base template for new models. It has been generated
|
|
95 |
|
96 |
#### Training Hyperparameters
|
97 |
|
98 |
-
|
99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100 |
|
101 |
## Evaluation
|
102 |
|
|
|
57 |
---
|
58 |
# RoBERTaLexPT-base
|
59 |
|
60 |
+
RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
|
61 |
|
|
|
62 |
|
63 |
## Model Details
|
64 |
|
|
|
85 |
|
86 |
### Training Procedure
|
87 |
|
88 |
+
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
|
89 |
+
This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
|
90 |
|
91 |
#### Preprocessing [optional]
|
92 |
|
|
|
95 |
|
96 |
#### Training Hyperparameters
|
97 |
|
98 |
+
| **Hyperparameter** | **RoBERTa-base** |
|
99 |
+
|------------------------|-----------------:|
|
100 |
+
| Number of layers | 12 |
|
101 |
+
| Hidden size | 768 |
|
102 |
+
| FFN inner hidden size | 3072 |
|
103 |
+
| Attention heads | 12 |
|
104 |
+
| Attention head size | 64 |
|
105 |
+
| Dropout | 0.1 |
|
106 |
+
| Attention dropout | 0.1 |
|
107 |
+
| Warmup steps | 6k |
|
108 |
+
| Peak learning rate | 4e-4 |
|
109 |
+
| Batch size | 2048 |
|
110 |
+
| Weight decay | 0.01 |
|
111 |
+
| Maximum training steps | 62.5k |
|
112 |
+
| Learning rate decay | Linear |
|
113 |
+
| AdamW $$\epsilon$$ | 1e-6 |
|
114 |
+
| AdamW $$\beta_1$$ | 0.9 |
|
115 |
+
| AdamW $$\beta_2$$ | 0.98 |
|
116 |
+
| Gradient clipping | 0.0 |
|
117 |
|
118 |
## Evaluation
|
119 |
|