Update README.md
Browse files
README.md
CHANGED
@@ -57,7 +57,7 @@ metrics:
|
|
57 |
---
|
58 |
# RoBERTaLexPT-base
|
59 |
|
60 |
-
RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
|
61 |
|
62 |
|
63 |
## Model Details
|
@@ -78,23 +78,34 @@ RoBERTaLexPT-base is pretrained from , using [RoBERTa-base](https://huggingface.
|
|
78 |
## Training Details
|
79 |
|
80 |
### Training Data
|
|
|
|
|
|
|
81 |
|
82 |
-
|
83 |
|
84 |
-
[
|
|
|
85 |
|
86 |
-
### Training Procedure
|
87 |
|
88 |
-
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
|
89 |
This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
|
90 |
|
91 |
-
#### Preprocessing
|
92 |
|
93 |
-
[
|
|
|
|
|
94 |
|
95 |
|
96 |
#### Training Hyperparameters
|
97 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
98 |
| **Hyperparameter** | **RoBERTa-base** |
|
99 |
|------------------------|-----------------:|
|
100 |
| Number of layers | 12 |
|
@@ -123,9 +134,7 @@ This computational setup is similar to the work of [BERTimbau](https://dl.acm.or
|
|
123 |
|
124 |
#### Testing Data
|
125 |
|
126 |
-
|
127 |
-
|
128 |
-
[More Information Needed]
|
129 |
|
130 |
#### Metrics
|
131 |
|
@@ -142,6 +151,5 @@ This computational setup is similar to the work of [BERTimbau](https://dl.acm.or
|
|
142 |
|
143 |
## Citation
|
144 |
|
145 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
146 |
|
147 |
[More Information Needed]
|
|
|
57 |
---
|
58 |
# RoBERTaLexPT-base
|
59 |
|
60 |
+
RoBERTaLexPT-base is pretrained from LegalPT corpus and CrawlPT corpus, using [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
|
61 |
|
62 |
|
63 |
## Model Details
|
|
|
78 |
## Training Details
|
79 |
|
80 |
### Training Data
|
81 |
+
RoBERTaLexPT-base is pretrained from both data:
|
82 |
+
- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
|
83 |
+
- CrawlPT is a duplication of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
|
84 |
|
85 |
+
### Training Procedure
|
86 |
|
87 |
+
Our pretraining process was executed using the [Fairseq library](https://arxiv.org/abs/1904.01038) on a DGX-A100 cluster, utilizing a total of 2 Nvidia A100 80 GB GPUs.
|
88 |
+
The complete training of a single configuration takes approximately three days.
|
89 |
|
|
|
90 |
|
|
|
91 |
This computational setup is similar to the work of [BERTimbau](https://dl.acm.org/doi/abs/10.1007/978-3-030-61377-8_28), exposing the model to approximately 65 billion tokens during training.
|
92 |
|
93 |
+
#### Preprocessing
|
94 |
|
95 |
+
Following the approach of [Lee et al. (2022)](http://arxiv.org/abs/2107.06499), we deduplicated all subsets of the LegalPT Corpus using the [MinHash algorithm](https://dl.acm.org/doi/abs/10.5555/647819.736184) and [Locality Sensitive Hashing](https://dspace.mit.edu/bitstream/handle/1721.1/134231/v008a014.pdf?sequence=2&isAllowed=y) to find clusters of duplicate documents.
|
96 |
+
|
97 |
+
To ensure that domain models are not constrained by a generic vocabulary, we utilized the [HuggingFace Tokenizers](https://github.com/huggingface/tokenizers) -- BPE algorithm to train a vocabulary for each pre-training corpus used.
|
98 |
|
99 |
|
100 |
#### Training Hyperparameters
|
101 |
|
102 |
+
The pretraining process involved training the model for 62,500 steps, with a batch size of 2048 sequences, each containing a maximum of 512 tokens.
|
103 |
+
We employed the masked language modeling objective, where 15\% of the input tokens were randomly masked.
|
104 |
+
The optimization was performed using the AdamW optimizer with a linear warmup and a linear decay learning rate schedule.
|
105 |
+
|
106 |
+
We adopted the standard [RoBERTa hyperparameters](https://arxiv.org/abs/1907.11692):
|
107 |
+
|
108 |
+
|
109 |
| **Hyperparameter** | **RoBERTa-base** |
|
110 |
|------------------------|-----------------:|
|
111 |
| Number of layers | 12 |
|
|
|
134 |
|
135 |
#### Testing Data
|
136 |
|
137 |
+
The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
|
|
|
|
|
138 |
|
139 |
#### Metrics
|
140 |
|
|
|
151 |
|
152 |
## Citation
|
153 |
|
|
|
154 |
|
155 |
[More Information Needed]
|