PORTULAN
/

gervasio-7b-portuguese-ptpt-decoder

Model card Files Files and versions Community

jarodrigues commited on Feb 28, 2024

Commit

79aedc8

verified ·

1 Parent(s): efc19b8

Update README.md

Browse files

Files changed (1) hide show

README.md +28 -19

README.md CHANGED Viewed

@@ -102,31 +102,40 @@ And from SuperGLUE, we included these other four tasks:
 Instruction templates have been manually crafted for each task.
 These take the various fields in the dataset and arrange them into a prompt.
-For instance, appending ``Frase 1:'' (Eng.~``Sentence 1:'') before the first sentence of an example in the RTE dataset.
 These templates are listed in full detail in TODO.
-## Preprocessing
-We filtered the PT-BR corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
-We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
-# Evaluation
-The base model version was evaluated on downstream tasks, namely the translations into PT-PT of the English data sets used for a few of the tasks in the widely-used [GLUE benchmark](https://huggingface.co/datasets/glue).
-## GLUE tasks translated
-We resorted to [GLUE-PT](https://huggingface.co/datasets/PORTULAN/glue-ptpt), a **PT-PT version of the GLUE** benchmark.
-We automatically translated the same four tasks from GLUE using [DeepL Translate](https://www.deepl.com/), which specifically provides translation from English to PT-PT as an option.
-| Model                    | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
-|--------------------------|----------------|----------------|-----------|-----------------|
-| **Albertina-PT-PT**      | **0.8339**     | 0.4225         | **0.9171**| **0.8801**      |
-| **Albertina-PT-PT base** |  0.6787        | **0.4507**     | 0.8829    | 0.8581          |
 <br>
 # How to use
@@ -135,9 +144,9 @@ You can use this model directly with a pipeline for causal language modeling (CL
 ```python3
 >>> from transformers import pipeline
->>> generator = pipeline(model='PORTULAN/gervasio-ptbr-base')
->>> generator("A música brasileira é", max_new_tokens=10)
-[{'generated_text': 'A música brasileira é uma das mais ricas do mundo. Ao'}]

 Instruction templates have been manually crafted for each task.
 These take the various fields in the dataset and arrange them into a prompt.
 These templates are listed in full detail in TODO.
+# Training Details
+We applied supervised fine-tuning with causal language modeling (CLM) training objective with a zero-out technique during the fine-tuning process.
+Specifically, while the entire prompt received attention during fine-tuning, only the response tokens were subjected to back-propagation.
+In terms of hyper-parameters, both models were trained with a learning rate of 2 * 10^-5, a weight decay of 0.1, a two-epoch training regime without warm-up, and to ensure the same number of tokens back-propagated per step, we employed an input sequence of 512 tokens with a batch size of 16 and 16 accumulation steps.
+Due to hardware limitations that imposed a shorter sequence length (512) compared to the base model (4096), instead of the typical practice of concatenating all training examples and then dividing them into batches with the same input sequence length, we separate each example individually.
+In other words, each example occupies the full input sequence length.
+To achieve this, we adapted the tokenizer of the base model to accept padding to allow grouping examples with different size into batches while preserving the original input sequence length.
+For the model training process, we resorted to an a2-megagpu-16gb Google Cloud A2 VM, equipped with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM.
+The training of each model took approximately two hours.
+# Evaluation
+For testing, we reserved the translated datasets MRPC (similarity) and RTE (inference), from GLUE, and COPA (reasoning/qa), from SuperGLUE, which were taking as representatives of three major types of tasks, and were not seen during training.
+We also employ data augmentation techniques to enhance the size and diversity of our dataset.
+This involves repurposing the tasks in various ways, such as generation of answers from MultiRC, question generation from BoolQ, and other relevant modifications.
++++ só para ptbr +++
+For further testing our decoder, in addition to the testing data described above, we also reused some of the datasets that had been resorted for American Portuguese
+to test the state-of-the-art Sabiá model and that were originally developed with materials from Portuguese:
+ASSIN~2 RTE (entailment) and ASSIN~2 STS (similarity), BLUEX (question answering), ENEM~2022 (question answering) and FaQuAD (extractive question-answering).
++++
+| Model                    | MRPC (F1)      | RTE (F1)       | COPA (F1) |
+|--------------------------|----------------|----------------|-----------|
+| **Gervásio 7B PT-PT**    | **0.7273**     | **0.8291**     | **0.5459**|
+| **LLaMA 2**              | 0.0328         | 0.0482         | 0.3844    |
+| **LLaMA 2 Chat**         | 0.5703         | 0.4697         | 0.4737    |
 <br>
 # How to use
 ```python3
 >>> from transformers import pipeline
+>>> generator = pipeline(model='PORTULAN/gervasio-ptpt-decoder')
+>>> generator("A música portuguesa é", max_new_tokens=10)
+[{'generated_text': 'A música portuguesa é uma das mais ricas do mundo'}]