Update README.md
Browse files
README.md
CHANGED
|
@@ -15,40 +15,77 @@ model-index:
|
|
| 15 |
- task:
|
| 16 |
type: token-classification
|
| 17 |
dataset:
|
| 18 |
-
type:
|
| 19 |
-
name: LeNER
|
| 20 |
-
config: LeNER-Br
|
| 21 |
split: test
|
| 22 |
metrics:
|
| 23 |
- type: seqeval
|
| 24 |
-
value:
|
| 25 |
-
name:
|
| 26 |
args:
|
| 27 |
scheme: IOB2
|
| 28 |
- task:
|
| 29 |
type: token-classification
|
| 30 |
dataset:
|
| 31 |
-
type: eduagarcia/
|
| 32 |
name: UlyNER-PL Coarse
|
| 33 |
config: UlyssesNER-Br-PL-coarse
|
| 34 |
split: test
|
| 35 |
metrics:
|
| 36 |
- type: seqeval
|
| 37 |
-
value:
|
| 38 |
-
name:
|
| 39 |
args:
|
| 40 |
scheme: IOB2
|
| 41 |
- task:
|
| 42 |
type: token-classification
|
| 43 |
dataset:
|
| 44 |
-
type: eduagarcia/
|
| 45 |
name: UlyNER-PL Fine
|
| 46 |
config: UlyssesNER-Br-PL-fine
|
| 47 |
split: test
|
| 48 |
metrics:
|
| 49 |
- type: seqeval
|
| 50 |
-
value:
|
| 51 |
-
name:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
args:
|
| 53 |
scheme: IOB2
|
| 54 |
license: cc-by-4.0
|
|
@@ -57,7 +94,7 @@ metrics:
|
|
| 57 |
---
|
| 58 |
# RoBERTaLexPT-base
|
| 59 |
|
| 60 |
-
RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/
|
| 61 |
|
| 62 |
- **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
|
| 63 |
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
|
|
@@ -66,7 +103,7 @@ RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch
|
|
| 66 |
|
| 67 |
## Evaluation
|
| 68 |
|
| 69 |
-
The model was evaluated on ["PortuLex" benchmark](eduagarcia/
|
| 70 |
|
| 71 |
Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
|
| 72 |
|
|
@@ -87,16 +124,16 @@ Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test spl
|
|
| 87 |
| RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
|
| 88 |
| RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
|
| 89 |
| RoBERTaCrawlPT-base (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 |
|
| 90 |
-
| RoBERTaLexPT-base (
|
| 91 |
|
| 92 |
In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
|
| 93 |
-
With sufficient pre-training data, it can surpass
|
| 94 |
|
| 95 |
## Training Details
|
| 96 |
|
| 97 |
RoBERTaLexPT-base is pretrained from both data:
|
| 98 |
-
- [LegalPT](https://huggingface.co/datasets/eduagarcia/
|
| 99 |
-
- CrawlPT is a
|
| 100 |
|
| 101 |
### Training Procedure
|
| 102 |
|
|
|
|
| 15 |
- task:
|
| 16 |
type: token-classification
|
| 17 |
dataset:
|
| 18 |
+
type: lener_br
|
| 19 |
+
name: LeNER-Br
|
|
|
|
| 20 |
split: test
|
| 21 |
metrics:
|
| 22 |
- type: seqeval
|
| 23 |
+
value: 0.9073
|
| 24 |
+
name: F1
|
| 25 |
args:
|
| 26 |
scheme: IOB2
|
| 27 |
- task:
|
| 28 |
type: token-classification
|
| 29 |
dataset:
|
| 30 |
+
type: eduagarcia/PortuLex_benchmark
|
| 31 |
name: UlyNER-PL Coarse
|
| 32 |
config: UlyssesNER-Br-PL-coarse
|
| 33 |
split: test
|
| 34 |
metrics:
|
| 35 |
- type: seqeval
|
| 36 |
+
value: 0.8856
|
| 37 |
+
name: F1
|
| 38 |
args:
|
| 39 |
scheme: IOB2
|
| 40 |
- task:
|
| 41 |
type: token-classification
|
| 42 |
dataset:
|
| 43 |
+
type: eduagarcia/PortuLex_benchmark
|
| 44 |
name: UlyNER-PL Fine
|
| 45 |
config: UlyssesNER-Br-PL-fine
|
| 46 |
split: test
|
| 47 |
metrics:
|
| 48 |
- type: seqeval
|
| 49 |
+
value: 0.8603
|
| 50 |
+
name: F1
|
| 51 |
+
args:
|
| 52 |
+
scheme: IOB2
|
| 53 |
+
- task:
|
| 54 |
+
type: token-classification
|
| 55 |
+
dataset:
|
| 56 |
+
type: eduagarcia/PortuLex_benchmark
|
| 57 |
+
name: FGV-STF
|
| 58 |
+
config: fgv-coarse
|
| 59 |
+
split: test
|
| 60 |
+
metrics:
|
| 61 |
+
- type: seqeval
|
| 62 |
+
value: 0.8040
|
| 63 |
+
name: F1
|
| 64 |
+
args:
|
| 65 |
+
scheme: IOB2
|
| 66 |
+
- task:
|
| 67 |
+
type: token-classification
|
| 68 |
+
dataset:
|
| 69 |
+
type: eduagarcia/PortuLex_benchmark
|
| 70 |
+
name: RRIP
|
| 71 |
+
config: rrip
|
| 72 |
+
split: test
|
| 73 |
+
metrics:
|
| 74 |
+
- type: seqeval
|
| 75 |
+
value: 0.8322
|
| 76 |
+
name: F1
|
| 77 |
+
args:
|
| 78 |
+
scheme: IOB2
|
| 79 |
+
- task:
|
| 80 |
+
type: token-classification
|
| 81 |
+
dataset:
|
| 82 |
+
type: eduagarcia/PortuLex_benchmark
|
| 83 |
+
name: PortuLex
|
| 84 |
+
split: test
|
| 85 |
+
metrics:
|
| 86 |
+
- type: seqeval
|
| 87 |
+
value: 0.8541
|
| 88 |
+
name: Average F1
|
| 89 |
args:
|
| 90 |
scheme: IOB2
|
| 91 |
license: cc-by-4.0
|
|
|
|
| 94 |
---
|
| 95 |
# RoBERTaLexPT-base
|
| 96 |
|
| 97 |
+
RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) and [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
|
| 98 |
|
| 99 |
- **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
|
| 100 |
- **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
|
|
|
|
| 103 |
|
| 104 |
## Evaluation
|
| 105 |
|
| 106 |
+
The model was evaluated on ["PortuLex" benchmark](eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
|
| 107 |
|
| 108 |
Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
|
| 109 |
|
|
|
|
| 124 |
| RoBERTaTimbau-base (Reproduction of BERTimbau) | 89.68 | 87.53/85.74 | 78.82 | 82.03 | 84.29 |
|
| 125 |
| RoBERTaLegalPT-base (Trained on LegalPT) | 90.59 | 85.45/84.40 | 79.92 | 82.84 | 84.57 |
|
| 126 |
| RoBERTaCrawlPT-base (Trained on CrawlPT) | 89.24 | 88.22/86.58 | 79.88 | 82.80 | 84.83 |
|
| 127 |
+
| **RoBERTaLexPT-base** (Trained on CrawlPT + LegalPT) | **90.73** | **88.56**/86.03 | **80.40** | 83.22 | **85.41** |
|
| 128 |
|
| 129 |
In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
|
| 130 |
+
With sufficient pre-training data, it can surpass larger models. The results highlight the importance of domain-diverse training data over sheer model scale.
|
| 131 |
|
| 132 |
## Training Details
|
| 133 |
|
| 134 |
RoBERTaLexPT-base is pretrained from both data:
|
| 135 |
+
- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
|
| 136 |
+
- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
|
| 137 |
|
| 138 |
### Training Procedure
|
| 139 |
|