jarodrigues commited on
Commit
bff6b01
·
verified ·
1 Parent(s): 3766ff6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -10
README.md CHANGED
@@ -107,22 +107,23 @@ These datasets were machine translated into Portuguese and from the [extraGLUE](
107
  Furthermore, instruction templates have been manually crafted for each task.
108
  These take the various fields in the dataset and arrange them into prompts, which were collected into the [extraGLUE-instruct](https://huggingface.co/datasets/PORTULAN/extraglue-instruct) dataset.
109
 
 
 
 
 
110
  # Training Details
111
 
112
- We applied supervised fine-tuning with a causal language modeling (CLM) training objective following a zero-out technique during the fine-tuning process.
113
  Specifically, while the entire prompt received attention during fine-tuning, only the response tokens were subjected to back-propagation.
114
 
115
- In terms of hyper-parameters, both models were trained with a learning rate of 2 * 10^-5, a weight decay of 0.1, a two-epoch training regime without warm-up, and to ensure the same number of tokens back-propagated per step, we employed an input sequence of 512 tokens with a batch size of 16 and 16 accumulation steps.
116
 
117
- Due to hardware limitations that imposed a shorter sequence length (512) compared to the base model (4096), instead of the typical practice of concatenating all training examples and then dividing them into batches with the same input sequence length, we separate each example individually.
118
  In other words, each example occupies the full input sequence length.
119
 
120
- To achieve this, we adapted the tokenizer of the base model to accept padding to allow grouping examples with different size into batches while preserving the original input sequence length.
121
 
122
- For the model training process, we resorted to an a2-megagpu-16gb Google Cloud A2 VM, equipped with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM.
123
- The training of each model took approximately two hours.
124
 
125
- # Evaluation
126
 
127
  For testing, we reserved the translated datasets MRPC (similarity) and RTE (inference), from GLUE, and COPA (reasoning/qa), from SuperGLUE, which were taking as representatives of three major types of tasks, and were not seen during training.
128
  We also employ data augmentation techniques to enhance the size and diversity of our dataset.
@@ -131,9 +132,9 @@ This involves repurposing the tasks in various ways, such as generation of answe
131
 
132
  | Model | MRPC (F1) | RTE (F1) | COPA (F1) |
133
  |--------------------------|----------------|----------------|-----------|
134
- | **Gervásio 7B PT-PT** | **0.7273** | **0.8291** | **0.5459**|
135
- | **LLaMA-2** | 0.0328 | 0.0482 | 0.3844 |
136
- | **LLaMA-2 Chat** | 0.5703 | 0.4697 | 0.4737 |
137
  <br>
138
 
139
  # How to use
 
107
  Furthermore, instruction templates have been manually crafted for each task.
108
  These take the various fields in the dataset and arrange them into prompts, which were collected into the [extraGLUE-instruct](https://huggingface.co/datasets/PORTULAN/extraglue-instruct) dataset.
109
 
110
+ We also employed data augmentation techniques to enhance the size and diversity of our dataset.
111
+ This involved repurposing the tasks in various ways, such as generation of answers from MultiRC, question generation from BoolQ, and other relevant modifications.
112
+
113
+
114
  # Training Details
115
 
116
+ We applied supervised fine-tuning with a causal language modeling training objective following a zero-out technique during the fine-tuning process.
117
  Specifically, while the entire prompt received attention during fine-tuning, only the response tokens were subjected to back-propagation.
118
 
119
+ In terms of hyper-parameters, the model was trained with a learning rate of 2 * 10^-5, a weight decay of 0.1, a two-epoch training regime without warm-up, and to ensure the same number of tokens back-propagated per step, we employed an input sequence of 512 tokens with a batch size of 16 and 16 accumulation steps.
120
 
121
+ Due to hardware limitations that imposed a shorter sequence length (512) compared to the base model (4096), instead of the typical practice of concatenating all training examples and then dividing them into batches with the same input sequence length, we separated each example individually.
122
  In other words, each example occupies the full input sequence length.
123
 
 
124
 
 
 
125
 
126
+ # Performance
127
 
128
  For testing, we reserved the translated datasets MRPC (similarity) and RTE (inference), from GLUE, and COPA (reasoning/qa), from SuperGLUE, which were taking as representatives of three major types of tasks, and were not seen during training.
129
  We also employ data augmentation techniques to enhance the size and diversity of our dataset.
 
132
 
133
  | Model | MRPC (F1) | RTE (F1) | COPA (F1) |
134
  |--------------------------|----------------|----------------|-----------|
135
+ | **Gervásio 7B PTPT** | **0.7273** | **0.8291** | **0.5459**|
136
+ | **LLaMA-2 (English)** | 0.0328 | 0.0482 | 0.3844 |
137
+ | **LLaMA-2 Chat (English)** | 0.5703 | 0.4697 | 0.4737 |
138
  <br>
139
 
140
  # How to use