eduagarcia
/

RoBERTaLexPT-base

Inference Endpoints

Model card Files Files and versions Community

eduagarcia commited on Feb 2, 2024

Commit

9e761cf

·

verified ·

1 Parent(s): fbad57e

Update README.md

Files changed (1) hide show

README.md +22 -5

README.md CHANGED Viewed

@@ -57,9 +57,8 @@ metrics:
 ---
 # RoBERTaLexPT-base
-<!-- Provide a quick summary of what the model is/does. -->
-This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
 ## Model Details
@@ -86,7 +85,8 @@ This modelcard aims to be a base template for new models. It has been generated
 ### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 #### Preprocessing [optional]
@@ -95,8 +95,25 @@ This modelcard aims to be a base template for new models. It has been generated
 #### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 ## Evaluation

 ---
 # RoBERTaLexPT-base
+RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
 ## Model Details
 ### Training Procedure
+The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
+This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
 #### Preprocessing [optional]
 #### Training Hyperparameters
+| **Hyperparameter**     | **RoBERTa-base** |
+|------------------------|-----------------:|
+| Number of layers       |               12 |
+| Hidden size            |              768 |
+| FFN inner hidden size  |             3072 |
+| Attention heads        |               12 |
+| Attention head size    |               64 |
+| Dropout                |              0.1 |
+| Attention dropout      |              0.1 |
+| Warmup steps           |               6k |
+| Peak learning rate     |             4e-4 |
+| Batch size             |             2048 |
+| Weight decay           |             0.01 |
+| Maximum training steps |            62.5k |
+| Learning rate decay    |           Linear |
+| AdamW $$\epsilon$$     |             1e-6 |
+| AdamW $$\beta_1$$      |              0.9 |
+| AdamW $$\beta_2$$      |             0.98 |
+| Gradient clipping      |              0.0 |
 ## Evaluation