Update README.md
Browse files
README.md
CHANGED
|
@@ -57,7 +57,7 @@ metrics:
|
|
| 57 |
---
|
| 58 |
# RoBERTaLexPT-base
|
| 59 |
|
| 60 |
-
RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
|
| 61 |
|
| 62 |
|
| 63 |
## Model Details
|
|
@@ -78,23 +78,34 @@ RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.
|
|
| 78 |
## Training Details
|
| 79 |
|
| 80 |
### Training Data
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
-
|
| 83 |
|
| 84 |
-
[
|
|
|
|
| 85 |
|
| 86 |
-
### Training Procedure
|
| 87 |
|
| 88 |
-
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
|
| 89 |
This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
|
| 90 |
|
| 91 |
-
#### Preprocessing
|
| 92 |
|
| 93 |
-
[
|
|
|
|
|
|
|
| 94 |
|
| 95 |
|
| 96 |
#### Training Hyperparameters
|
| 97 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
| **Hyperparameter** | **RoBERTa-base** |
|
| 99 |
|------------------------|-----------------:|
|
| 100 |
| Number of layers | 12 |
|
|
@@ -123,9 +134,7 @@ This computational setup is similar to the work of [BERTimbau](https://dl.acm.or
|
|
| 123 |
|
| 124 |
#### Testing Data
|
| 125 |
|
| 126 |
-
|
| 127 |
-
|
| 128 |
-
[More Information Needed]
|
| 129 |
|
| 130 |
#### Metrics
|
| 131 |
|
|
@@ -142,6 +151,5 @@ This computational setup is similar to the work of [BERTimbau](https://dl.acm.or
|
|
| 142 |
|
| 143 |
## Citation
|
| 144 |
|
| 145 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 146 |
|
| 147 |
[More Information Needed]
|
|
|
|
| 57 |
---
|
| 58 |
# RoBERTaLexPT-base
|
| 59 |
|
| 60 |
+
RoBERTaLexPT-base is pretrained from LegalPT corpus and CrawlPT corpus, using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
|
| 61 |
|
| 62 |
|
| 63 |
## Model Details
|
|
|
|
| 78 |
## Training Details
|
| 79 |
|
| 80 |
### Training Data
|
| 81 |
+
RoBERTaLexPT-base is pretrained from both data:
|
| 82 |
+
- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
|
| 83 |
+
- CrawlPT is a duplication of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
|
| 84 |
|
| 85 |
+
### Training Procedure
|
| 86 |
|
| 87 |
+
Our pretraining process was executed using the [Fairseq library](https://arxiv.org/abs/1904.01038) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
|
| 88 |
+
The complete training of a single configuration takes approximately three days.
|
| 89 |
|
|
|
|
| 90 |
|
|
|
|
| 91 |
This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
|
| 92 |
|
| 93 |
+
#### Preprocessing
|
| 94 |
|
| 95 |
+
Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499), we deduplicated all subsets of the LegalPT Corpus using the [MinHash algorithm](https://dl.acm.org/doi/abs/10.5555/647819.736184) and [Locality Sensitive Hashing](https://dspace.mit.edu/bitstream/handle/1721.1/134231/v008a014.pdf?sequence=2&isAllowed=y) to find clusters of duplicate documents.
|
| 96 |
+
|
| 97 |
+
To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
|
| 98 |
|
| 99 |
|
| 100 |
#### Training Hyperparameters
|
| 101 |
|
| 102 |
+
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
|
| 103 |
+
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
|
| 104 |
+
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
|
| 105 |
+
|
| 106 |
+
We adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
|
| 107 |
+
|
| 108 |
+
|
| 109 |
| **Hyperparameter** | **RoBERTa-base** |
|
| 110 |
|------------------------|-----------------:|
|
| 111 |
| Number of layers | 12 |
|
|
|
|
| 134 |
|
| 135 |
#### Testing Data
|
| 136 |
|
| 137 |
+
The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
|
|
|
|
|
|
|
| 138 |
|
| 139 |
#### Metrics
|
| 140 |
|
|
|
|
| 151 |
|
| 152 |
## Citation
|
| 153 |
|
|
|
|
| 154 |
|
| 155 |
[More Information Needed]
|