PORTULAN
/

gervasio-7b-portuguese-ptpt-decoder

@@ -107,22 +107,23 @@ These datasets were machine translated into Portuguese and from the [extraGLUE](
 Furthermore, instruction templates have been manually crafted for each task.
 These take the various fields in the dataset and arrange them into prompts, which were collected into the [extraGLUE-instruct](https://huggingface.co/datasets/PORTULAN/extraglue-instruct) dataset.
 # Training Details
-We applied supervised fine-tuning with a causal language modeling (CLM) training objective following a zero-out technique during the fine-tuning process.
 Specifically, while the entire prompt received attention during fine-tuning, only the response tokens were subjected to back-propagation.
-In terms of hyper-parameters, both models were trained with a learning rate of 2 * 10^-5, a weight decay of 0.1, a two-epoch training regime without warm-up, and to ensure the same number of tokens back-propagated per step, we employed an input sequence of 512 tokens with a batch size of 16 and 16 accumulation steps.
-Due to hardware limitations that imposed a shorter sequence length (512) compared to the base model (4096), instead of the typical practice of concatenating all training examples and then dividing them into batches with the same input sequence length, we separate each example individually.
 In other words, each example occupies the full input sequence length.
-To achieve this, we adapted the tokenizer of the base model to accept padding to allow grouping examples with different size into batches while preserving the original input sequence length.
-For the model training process, we resorted to an a2-megagpu-16gb Google Cloud A2 VM, equipped with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM.
-The training of each model took approximately two hours.
-# Evaluation
 For testing, we reserved the translated datasets MRPC (similarity) and RTE (inference), from GLUE, and COPA (reasoning/qa), from SuperGLUE, which were taking as representatives of three major types of tasks, and were not seen during training.
 We also employ data augmentation techniques to enhance the size and diversity of our dataset.
@@ -131,9 +132,9 @@ This involves repurposing the tasks in various ways, such as generation of answe
 | Model                    | MRPC (F1)      | RTE (F1)       | COPA (F1) |
 |--------------------------|----------------|----------------|-----------|
-| **Gervásio 7B PT-PT**    | **0.7273**     | **0.8291**     | **0.5459**|
-| **LLaMA-2**              | 0.0328         | 0.0482         | 0.3844    |
-| **LLaMA-2 Chat**         | 0.5703         | 0.4697         | 0.4737    |
 <br>
 # How to use

 Furthermore, instruction templates have been manually crafted for each task.
 These take the various fields in the dataset and arrange them into prompts, which were collected into the [extraGLUE-instruct](https://huggingface.co/datasets/PORTULAN/extraglue-instruct) dataset.
+We also employed data augmentation techniques to enhance the size and diversity of our dataset.
+This involved repurposing the tasks in various ways, such as generation of answers from MultiRC, question generation from BoolQ, and other relevant modifications.
 # Training Details
+We applied supervised fine-tuning with a causal language modeling training objective following a zero-out technique during the fine-tuning process.
 Specifically, while the entire prompt received attention during fine-tuning, only the response tokens were subjected to back-propagation.
+In terms of hyper-parameters, the model was trained with a learning rate of 2 * 10^-5, a weight decay of 0.1, a two-epoch training regime without warm-up, and to ensure the same number of tokens back-propagated per step, we employed an input sequence of 512 tokens with a batch size of 16 and 16 accumulation steps.
+Due to hardware limitations that imposed a shorter sequence length (512) compared to the base model (4096), instead of the typical practice of concatenating all training examples and then dividing them into batches with the same input sequence length, we separated each example individually.
 In other words, each example occupies the full input sequence length.
+# Performance
 For testing, we reserved the translated datasets MRPC (similarity) and RTE (inference), from GLUE, and COPA (reasoning/qa), from SuperGLUE, which were taking as representatives of three major types of tasks, and were not seen during training.
 We also employ data augmentation techniques to enhance the size and diversity of our dataset.
 | Model                    | MRPC (F1)      | RTE (F1)       | COPA (F1) |
 |--------------------------|----------------|----------------|-----------|
+| **Gervásio 7B PTPT**    | **0.7273**     | **0.8291**     | **0.5459**|
+| **LLaMA-2 (English)**              | 0.0328         | 0.0482         | 0.3844    |
+| **LLaMA-2 Chat (English)**         | 0.5703         | 0.4697         | 0.4737    |
 <br>
 # How to use