eduagarcia
/

RoBERTaLexPT-base

@@ -15,40 +15,77 @@ model-index:
   - task:
       type: token-classification
     dataset:
-      type: eduagarcia/portuguese_benchmark
-      name: LeNER
-      config: LeNER-Br
       split: test
     metrics:
     - type: seqeval
-      value: 90.73
-      name: Mean F1
       args:
         scheme: IOB2
   - task:
       type: token-classification
     dataset:
-      type: eduagarcia/portuguese_benchmark
       name: UlyNER-PL Coarse
       config: UlyssesNER-Br-PL-coarse
       split: test
     metrics:
     - type: seqeval
-      value: 88.56
-      name: Mean F1
       args:
         scheme: IOB2
   - task:
       type: token-classification
     dataset:
-      type: eduagarcia/portuguese_benchmark
       name: UlyNER-PL Fine
       config: UlyssesNER-Br-PL-fine
       split: test
     metrics:
     - type: seqeval
-      value: 86.03
-      name: Mean F1
       args:
         scheme: IOB2
 license: cc-by-4.0
@@ -57,7 +94,7 @@ metrics:
 ---
 # RoBERTaLexPT-base
-RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT) and CrawlPT corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
 - **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
 - **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
@@ -66,7 +103,7 @@ RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch
 ## Evaluation
-The model was evaluated on ["PortuLex" benchmark](eduagarcia/portuguese_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
 Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
@@ -87,16 +124,16 @@ Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test spl
 | RoBERTaTimbau-base (Reproduction of BERTimbau)                             | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
 | RoBERTaLegalPT-base (Trained on LegalPT)                                   | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
 | RoBERTaCrawlPT-base  (Trained on CrawlPT)                                  | 89.24     | 88.22/86.58     | 79.88       |   82.80   | 84.83           |
-| RoBERTaLexPT-base (this) (Trained on CrawlPT + LegalPT)                    | **90.73** | **88.56**/86.03 | **80.40**   |   83.22   | **85.41**       |
 In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
-With sufficient pre-training data, it can surpass overparameterized models. The results highlight the importance of domain-diverse training data over sheer model scale.
 ## Training Details
 RoBERTaLexPT-base is pretrained from both data:
-- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
-- CrawlPT is a duplication of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
 ### Training Procedure

   - task:
       type: token-classification
     dataset:
+      type: lener_br
+      name: LeNER-Br
       split: test
     metrics:
     - type: seqeval
+      value: 0.9073
+      name: F1
       args:
         scheme: IOB2
   - task:
       type: token-classification
     dataset:
+      type: eduagarcia/PortuLex_benchmark
       name: UlyNER-PL Coarse
       config: UlyssesNER-Br-PL-coarse
       split: test
     metrics:
     - type: seqeval
+      value: 0.8856
+      name: F1
       args:
         scheme: IOB2
   - task:
       type: token-classification
     dataset:
+      type: eduagarcia/PortuLex_benchmark
       name: UlyNER-PL Fine
       config: UlyssesNER-Br-PL-fine
       split: test
     metrics:
     - type: seqeval
+      value: 0.8603
+      name: F1
+      args:
+        scheme: IOB2
+  - task:
+      type: token-classification
+    dataset:
+      type: eduagarcia/PortuLex_benchmark
+      name: FGV-STF
+      config: fgv-coarse
+      split: test
+    metrics:
+    - type: seqeval
+      value: 0.8040
+      name: F1
+      args:
+        scheme: IOB2
+  - task:
+      type: token-classification
+    dataset:
+      type: eduagarcia/PortuLex_benchmark
+      name: RRIP
+      config: rrip
+      split: test
+    metrics:
+    - type: seqeval
+      value: 0.8322
+      name: F1
+      args:
+        scheme: IOB2
+  - task:
+      type: token-classification
+    dataset:
+      type: eduagarcia/PortuLex_benchmark
+      name: PortuLex
+      split: test
+    metrics:
+    - type: seqeval
+      value: 0.8541
+      name: Average F1
       args:
         scheme: IOB2
 license: cc-by-4.0
 ---
 # RoBERTaLexPT-base
+RoBERTaLexPT-base is a Portuguese Masked Language Model pretrained from scratch from the [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) and [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) corpora, using the same architecture as [RoBERTa-base](https://huggingface.co/FacebookAI/roberta-base), introduced by [Liu et al. (2019)](https://arxiv.org/abs/1907.11692).
 - **Language(s) (NLP):** Brazilian Portuguese (pt-BR)
 - **License:** [Creative Commons Attribution 4.0 International Public License](https://creativecommons.org/licenses/by/4.0/deed.en)
 ## Evaluation
+The model was evaluated on ["PortuLex" benchmark](eduagarcia/PortuLex_benchmark), a four-task benchmark designed to evaluate the quality and performance of language models in the Portuguese legal domain.
 Macro F1-Score (\%) for multiple models evaluated on PortuLex benchmark test splits:
 | RoBERTaTimbau-base (Reproduction of BERTimbau)                             | 89.68     | 87.53/85.74     | 78.82       |   82.03   | 84.29           |
 | RoBERTaLegalPT-base (Trained on LegalPT)                                   | 90.59     | 85.45/84.40     | 79.92       |   82.84   | 84.57           |
 | RoBERTaCrawlPT-base  (Trained on CrawlPT)                                  | 89.24     | 88.22/86.58     | 79.88       |   82.80   | 84.83           |
+| **RoBERTaLexPT-base** (Trained on CrawlPT + LegalPT)                       | **90.73** | **88.56**/86.03 | **80.40**   |   83.22   | **85.41**       |
 In summary, RoBERTaLexPT consistently achieves top legal NLP effectiveness despite its base size.
+With sufficient pre-training data, it can surpass larger models. The results highlight the importance of domain-diverse training data over sheer model scale.
 ## Training Details
 RoBERTaLexPT-base is pretrained from both data:
+- [LegalPT](https://huggingface.co/datasets/eduagarcia/LegalPT_dedup) is a Portuguese legal corpus by aggregating diverse sources of up to 125GiB data.
+- [CrawlPT](https://huggingface.co/datasets/eduagarcia/CrawlPT_dedup) is a composition of three Portuguese general corpora: [brWaC](https://huggingface.co/datasets/eduagarcia/brwac_dedup), [CC100-PT](https://huggingface.co/datasets/eduagarcia/cc100-pt), [OSCAR-2301](https://huggingface.co/datasets/eduagarcia/OSCAR-2301-pt_dedup).
 ### Training Procedure