eduagarcia
/

RoBERTaLexPT-base

@@ -57,7 +57,7 @@ metrics:
 ---
 # RoBERTaLexPT-base
-RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
 ## Model Details
@@ -78,23 +78,34 @@ RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.
 ## Training Details
 ### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
 This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
-#### Preprocessing [optional]
-[More Information Needed]
 #### Training Hyperparameters
 | **Hyperparameter**     | **RoBERTa-base** |
 |------------------------|-----------------:|
 | Number of layers       |               12 |
@@ -123,9 +134,7 @@ This computational setup is similar to the work of [BERTimbau](https://dl.acm.or
 #### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
 #### Metrics
@@ -142,6 +151,5 @@ This computational setup is similar to the work of [BERTimbau](https://dl.acm.or
 ## Citation
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 [More Information Needed]

 ---
 # RoBERTaLexPT-base
+RoBERTaLexPT-base is pretrained from LegalPT corpus and CrawlPT corpus, using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
 ## Model Details
 ## Training Details
 ### Training Data
+RoBERTaLexPT-base is pretrained from both data:
+- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
+- CrawlPT is a duplication of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
+### Training Procedure
+Our pretraining process was executed using the [Fairseq library](https://arxiv.org/abs/1904.01038) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
+The complete training of a single configuration takes approximately three days.
 This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
+#### Preprocessing
+Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499), we deduplicated all subsets of the LegalPT Corpus using the [MinHash algorithm](https://dl.acm.org/doi/abs/10.5555/647819.736184) and [Locality Sensitive Hashing](https://dspace.mit.edu/bitstream/handle/1721.1/134231/v008a014.pdf?sequence=2&isAllowed=y) to find clusters of duplicate documents.
+To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
 #### Training Hyperparameters
+The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
+We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
+The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
+We adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
 | **Hyperparameter**     | **RoBERTa-base** |
 |------------------------|-----------------:|
 | Number of layers       |               12 |
 #### Testing Data
+The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
 #### Metrics
 ## Citation
 [More Information Needed]