Update README.md
Browse files
README.md
CHANGED
|
@@ -57,9 +57,8 @@ metrics:
|
|
| 57 |
---
|
| 58 |
# RoBERTaLexPT-base
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
-
This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
|
| 63 |
|
| 64 |
## Model Details
|
| 65 |
|
|
@@ -86,7 +85,8 @@ This modelcard aims to be a base template for new models. It has been generated
|
|
| 86 |
|
| 87 |
### Training Procedure
|
| 88 |
|
| 89 |
-
|
|
|
|
| 90 |
|
| 91 |
#### Preprocessing [optional]
|
| 92 |
|
|
@@ -95,8 +95,25 @@ This modelcard aims to be a base template for new models. It has been generated
|
|
| 95 |
|
| 96 |
#### Training Hyperparameters
|
| 97 |
|
| 98 |
-
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
|
| 101 |
## Evaluation
|
| 102 |
|
|
|
|
| 57 |
---
|
| 58 |
# RoBERTaLexPT-base
|
| 59 |
|
| 60 |
+
RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
|
| 61 |
|
|
|
|
| 62 |
|
| 63 |
## Model Details
|
| 64 |
|
|
|
|
| 85 |
|
| 86 |
### Training Procedure
|
| 87 |
|
| 88 |
+
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
|
| 89 |
+
This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
|
| 90 |
|
| 91 |
#### Preprocessing [optional]
|
| 92 |
|
|
|
|
| 95 |
|
| 96 |
#### Training Hyperparameters
|
| 97 |
|
| 98 |
+
| **Hyperparameter** | **RoBERTa-base** |
|
| 99 |
+
|------------------------|-----------------:|
|
| 100 |
+
| Number of layers | 12 |
|
| 101 |
+
| Hidden size | 768 |
|
| 102 |
+
| FFN inner hidden size | 3072 |
|
| 103 |
+
| Attention heads | 12 |
|
| 104 |
+
| Attention head size | 64 |
|
| 105 |
+
| Dropout | 0.1 |
|
| 106 |
+
| Attention dropout | 0.1 |
|
| 107 |
+
| Warmup steps | 6k |
|
| 108 |
+
| Peak learning rate | 4e-4 |
|
| 109 |
+
| Batch size | 2048 |
|
| 110 |
+
| Weight decay | 0.01 |
|
| 111 |
+
| Maximum training steps | 62.5k |
|
| 112 |
+
| Learning rate decay | Linear |
|
| 113 |
+
| AdamW $$\epsilon$$ | 1e-6 |
|
| 114 |
+
| AdamW $$\beta_1$$ | 0.9 |
|
| 115 |
+
| AdamW $$\beta_2$$ | 0.98 |
|
| 116 |
+
| Gradient clipping | 0.0 |
|
| 117 |
|
| 118 |
## Evaluation
|
| 119 |
|