jarodrigues commited on
Commit
79aedc8
·
verified ·
1 Parent(s): efc19b8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -19
README.md CHANGED
@@ -102,31 +102,40 @@ And from SuperGLUE, we included these other four tasks:
102
 
103
  Instruction templates have been manually crafted for each task.
104
  These take the various fields in the dataset and arrange them into a prompt.
105
- For instance, appending ``Frase 1:'' (Eng.~``Sentence 1:'') before the first sentence of an example in the RTE dataset.
106
  These templates are listed in full detail in TODO.
107
 
108
- ## Preprocessing
109
 
110
- We filtered the PT-BR corpora using the [BLOOM pre-processing](https://github.com/bigscience-workshop/data-preparation) pipeline.
111
- We skipped the default filtering of stopwords since it would disrupt the syntactic structure, and also the filtering for language identification given the corpus was pre-selected as Portuguese.
112
 
 
113
 
114
- # Evaluation
115
-
116
- The base model version was evaluated on downstream tasks, namely the translations into PT-PT of the English data sets used for a few of the tasks in the widely-used [GLUE benchmark](https://huggingface.co/datasets/glue).
117
-
118
 
119
- ## GLUE tasks translated
120
 
 
 
121
 
122
- We resorted to [GLUE-PT](https://huggingface.co/datasets/PORTULAN/glue-ptpt), a **PT-PT version of the GLUE** benchmark.
123
- We automatically translated the same four tasks from GLUE using [DeepL Translate](https://www.deepl.com/), which specifically provides translation from English to PT-PT as an option.
124
-
125
- | Model | RTE (Accuracy) | WNLI (Accuracy)| MRPC (F1) | STS-B (Pearson) |
126
- |--------------------------|----------------|----------------|-----------|-----------------|
127
- | **Albertina-PT-PT** | **0.8339** | 0.4225 | **0.9171**| **0.8801** |
128
- | **Albertina-PT-PT base** | 0.6787 | **0.4507** | 0.8829 | 0.8581 |
129
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
  <br>
131
 
132
  # How to use
@@ -135,9 +144,9 @@ You can use this model directly with a pipeline for causal language modeling (CL
135
 
136
  ```python3
137
  >>> from transformers import pipeline
138
- >>> generator = pipeline(model='PORTULAN/gervasio-ptbr-base')
139
- >>> generator("A música brasileira é", max_new_tokens=10)
140
- [{'generated_text': 'A música brasileira é uma das mais ricas do mundo. Ao'}]
141
 
142
 
143
 
 
102
 
103
  Instruction templates have been manually crafted for each task.
104
  These take the various fields in the dataset and arrange them into a prompt.
 
105
  These templates are listed in full detail in TODO.
106
 
107
+ # Training Details
108
 
109
+ We applied supervised fine-tuning with causal language modeling (CLM) training objective with a zero-out technique during the fine-tuning process.
110
+ Specifically, while the entire prompt received attention during fine-tuning, only the response tokens were subjected to back-propagation.
111
 
112
+ In terms of hyper-parameters, both models were trained with a learning rate of 2 * 10^-5, a weight decay of 0.1, a two-epoch training regime without warm-up, and to ensure the same number of tokens back-propagated per step, we employed an input sequence of 512 tokens with a batch size of 16 and 16 accumulation steps.
113
 
114
+ Due to hardware limitations that imposed a shorter sequence length (512) compared to the base model (4096), instead of the typical practice of concatenating all training examples and then dividing them into batches with the same input sequence length, we separate each example individually.
115
+ In other words, each example occupies the full input sequence length.
 
 
116
 
117
+ To achieve this, we adapted the tokenizer of the base model to accept padding to allow grouping examples with different size into batches while preserving the original input sequence length.
118
 
119
+ For the model training process, we resorted to an a2-megagpu-16gb Google Cloud A2 VM, equipped with 16 GPUs, 96 vCPUs, and 1.360 GB of RAM.
120
+ The training of each model took approximately two hours.
121
 
122
+ # Evaluation
 
 
 
 
 
 
123
 
124
+ For testing, we reserved the translated datasets MRPC (similarity) and RTE (inference), from GLUE, and COPA (reasoning/qa), from SuperGLUE, which were taking as representatives of three major types of tasks, and were not seen during training.
125
+ We also employ data augmentation techniques to enhance the size and diversity of our dataset.
126
+ This involves repurposing the tasks in various ways, such as generation of answers from MultiRC, question generation from BoolQ, and other relevant modifications.
127
+
128
+ +++ só para ptbr +++
129
+ For further testing our decoder, in addition to the testing data described above, we also reused some of the datasets that had been resorted for American Portuguese
130
+ to test the state-of-the-art Sabiá model and that were originally developed with materials from Portuguese:
131
+ ASSIN~2 RTE (entailment) and ASSIN~2 STS (similarity), BLUEX (question answering), ENEM~2022 (question answering) and FaQuAD (extractive question-answering).
132
+ +++
133
+
134
+ | Model | MRPC (F1) | RTE (F1) | COPA (F1) |
135
+ |--------------------------|----------------|----------------|-----------|
136
+ | **Gervásio 7B PT-PT** | **0.7273** | **0.8291** | **0.5459**|
137
+ | **LLaMA 2** | 0.0328 | 0.0482 | 0.3844 |
138
+ | **LLaMA 2 Chat** | 0.5703 | 0.4697 | 0.4737 |
139
  <br>
140
 
141
  # How to use
 
144
 
145
  ```python3
146
  >>> from transformers import pipeline
147
+ >>> generator = pipeline(model='PORTULAN/gervasio-ptpt-decoder')
148
+ >>> generator("A música portuguesa é", max_new_tokens=10)
149
+ [{'generated_text': 'A música portuguesa é uma das mais ricas do mundo'}]
150
 
151
 
152