File size: 13,798 Bytes

e35c725
3cb90da
e35c725
 
 
 
 
57d6bc2
e35c725
57d6bc2
 
e35c725
 
 
 
dc42a98
be7b76a
aebacec
e35c725
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57d6bc2
 
bf6a888
3a061d6
e35c725
 
 
 
 
57d6bc2
bf6a888
e35c725
3cb90da
57d6bc2
3cb90da
 
e35c725
 
57d6bc2
e35c725
 
 
57d6bc2
e35c725
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57d6bc2
e35c725
 
 
 
 
 
 
 
 
 
 
 
 
 
 
57d6bc2
e35c725
57d6bc2
e35c725
57d6bc2
e35c725
57d6bc2
e35c725
57d6bc2
e35c725
 
 
 
 
57d6bc2
e35c725
 
 
 
 
 
 
57d6bc2
e35c725
 
 
 
 
 
 
 
 
 
57d6bc2
e35c725
 
 
 
 
 
 
 
 
f0b1819
57d6bc2
e35c725
 
 
 
57d6bc2
85403b1
e35c725
00683e5
 
 
e335fb0
00683e5
 
 
 
 
 
e35c725
 
 
85403b1
 
 
 
dd6ab2f
0feeda2
d9d8941
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c1c92d
 
 
d9d8941
4c1c92d
d9d8941
 
4c1c92d
 
d9d8941
 
 
 
 
4c1c92d
 
 
d9d8941
 
 
 
4c1c92d
 
85403b1
e35c725
 
 
3a061d6
e35c725
 
3a061d6
e35c725
 
 
3a061d6
e35c725
 
 
 
ad9e808
e35c725
 
 
3a061d6
e35c725
 
 
3a061d6
e35c725
 
 
3a061d6
 
 
 
 
 
ad9e808

---
license: apache-2.0
language:
- ca
- va
tags:
- FLOR
- Bloom
- Aitana
- Catalan
- Valencian
pipeline_tag: text-generation
---

# AITANA-6.3B

<img src="https://cdn-uploads.huggingface.co/production/uploads/639873bb315923c0d5b4c883/6EPbzDJbYtyX_oS15K6jF.png" width="50%" height="50%"/>

## Table of Contents
<details>
<summary>Click to expand</summary>

- [Model description](#model-description)
- [Intended uses and limitations](#intended-uses-and-limitations)
- [Demo](#demo)
- [How to use](#how-to-use)
- [Limitations and bias](#limitations-and-bias)
- [Training](#training)
- [Evaluation](#evaluation)
- [Additional information](#additional-information)

</details>

## Model description

**AITANA-6.3B** is a text generation model for causal language modeling with a decoder-only architecture. 
It has been trained from continuous pre-training based on [FLOR-6.3B](https://huggingface.co/projecte-aina/FLOR-6.3B), with emphasis on data (listed below) 
in **Valencian** (similar to Catalan) language. Concretely, a total of 1.304 million tokens per epoch in this first version of the model and two epochs over the data. The **Political and Administrative domains** are highly represented in this model's version.


This model is based on FLOR-6.3B as the basis for training and uses the same tokenizer.

## Intended uses and limitations

As **FLOR-6.3B**, **AITANA-6.3B** is a base model that can be used for causal language modeling, it can be used as is for text generation, 
although **fine/instruction-tuning on specific tasks is recommended for its final use**.

This language model has been trained with data in a formal register, namely related to the 
administrative and political domain, so it is expected that using it in text-generation tasks 
will produce text in this same format.

## Demo 

In the following link, you can access an interactive demo to test the text generation in the language model:

Demo link(https://llm-aitana.gplsi.es/) 

In the demo, you can adjust the number of words generated as well as the decoding technique to be used by 
the model (top p, top k) and other parameters such as temperature.

## How to use
```python
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

input_text = "Les corts valencianes han pres la decisió de"

model_id  = "gplsi/Aitana-6.3B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
generator = pipeline(
    "text-generation",
    model=model_id,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
)
generation = generator(
    input_text,
    do_sample=True,
    top_k=10,
    eos_token_id=tokenizer.eos_token_id,
)

print(f"Result: {generation[0]['generated_text']}")

```

## Training

### Training data

The training corpus has been obtained using web scraping on public data from different sources such as the 
[Official Gazette of the University of Alicante (BOUA)](https://www.boua.ua.es/ca), [the Official Gazette of the Generalitat Valenciana (DOGV)](https://dogv.gva.es/va) and accurate data provided by 
[the Valencian Courts (DSCV and DSCCV)](https://www.cortsvalencianes.es/ca-va/). Giving a total of 1.304 million tokens, according to the following table.

Dataset	| Language	| Words (per-epoch)	| Epochs	| Total Tokens |
|---------------------|----------|--------------------|--------------|--------------|
DSCV	| va	| 31.98M	| 2	| 57.05M |
DSCCV   | va    | 45.59M | 2 | 80.91M |
BOUA    | va    | 11.65M | 2 | 29.02M |
DOGV    | va    | 301.59M | 2 | 982.33M |
DOGCV   | va    | 54.92M | 2 | 154.32M |

Several of the downloaded sources have already been used in the FLOR-6.3B training, so the date of data collection for the previous 
model has been taken into account and those web pages have been scraped from that date.

Information on the datasets used for training is shown below:

- BOUA: Official Bulletin of the University of Alicante. In this case, we are dealing with documents issued by the University of Alicante in Valencian about grants, calls issued by the university, regulations, resolutions of laws that affect the university environment, and corrections of errors of these same documents issued previously.

- DOGV: Official Journal of the Generalitat Valenciana. This dataset contains official communiqués of different kinds issued by the Generalitat Valenciana, with data entirely in Valencian. It mainly talks about measures taken in the legal field, approval of laws, and public sector communiqués. In this case, we have 18 different documents covering communiqués from 1998 to 2018 and three more recent documents with data from 2019 to 2023.

- DOGCV: in this case, it is the Official Journal of the Generalitat Valenciana, but only the historical documents from 1980 to 1997. 

- DSCV: Journal of the Valencian Parliament. This dataset contains transcriptions of the different interventions made during the plenary sessions in the Valencian Parliament by the different participants. It covers data from 2001 to 1999 up to 2022, each transcript comprises a .html file.

- DSCCV: this is a dataset of the Valencian Parliament diary, centered on transcriptions of the different commissions held. As in the previous case, it is separated into one file for each transcription.


### Training parameters

During the training of the model, a high context window was desired when generating text, so it was decided to use an input size of 2048 
tokens and a minimum context window of 512 in case of truncating the input sequences. 80% of the data obtained was used for the training stage, 
while 20% was used during the evaluation stage. A summary of the parameters used during training can be seen in the following table:

Parameter	| Value	| 
|---------------------|---|
Epochs              |    1  |
Learning Rate       |  2e-5 |
Warmup Steps        |     0 |
Precision          | bf-16 |
Weight decay        | 1e-1  |
Training Fraction   | 0.8   |
Evaluation Fraction | 0.2   |
Input size (tokens)	| 2048	|
Minimum context window (tokens)   | 512    |
Training time (hours/epoch)       | 40 |
 
### Devices

A total of 4 A100 graphics cards with a maximum capacity of 40 GB each were used to train the model. This meant a training time of approximately 
40 hours per epoch. Using a mini-batch size of size 2 and a batch size of size 32 to calculate backpropagation.

### Distributed Training Strategy

A distributed training strategy called Fully Sharded Data Parallel ([FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html))
has been used. With this, the entire model has been loaded among the 4 A100s available for training with a mini-batch size of size 2 as 
previously discussed.

### Languages

In addition to the data already used for the training of FLOR-6.3B, data completely in **Valencian** from the sources mentioned in 
the previous section has been used.


## Evaluation
The model has been evaluated using the loss function and perplexity during the training stage and these metrics have also been 
obtained during the evaluation stage. Due to the low amount of data, it was decided to evaluate 
at the end of each epoch.

| Epoch        | Mode       |  Loss    |  Perplexity |
|--------------|------------|----------|-------------|
|   1          | Training   |  0.6944  |    2.111    |
|   1          | Evaluation |  0.247   |    1.28     |
|   2          | Training   |  0.5335  |    1.705    |
|   2          | Evaluation |  0.4004  |    1.007    |
|   3          | Training   |  0.4768  |    1.611    |
|   3          | Evaluation |  0.9141  |    1.007    |
|   4          | Training   |  0.4586  |    1.582    |
|   4          | Evaluation |  0.125   |    1.007    |

### Results

In the following table, we can see the results obtained with different benchmarks in comparison with 
the model used for continuous pre-training. The results have been obtained from the model pre-trained; 
no instruction tuning or fine-tuning of any kind has been performed.

| Dataset                      |  Lang. |          Task             | Metric  | Aitana-6.3B |  Flor-6.3B  |
|------------------------------|--------|---------------------------|---------|-------------|-------------|
| Belebele Cat_latn            |   ca   | Reading Comprehension     |     acc |   **24.33** |  21.89      |
| CATCOLA                      |   ca   | Linguistic Acceptability  |     mcc |   -0.04     | **0.04**    |
| COPA                         |   ca   | Commonsense Reasoning     |     acc |   75.6      | **76.8**    |
| XStoryCloze                  |   ca   | Commonsense Reasoning     |     f1  |  **72.14**  |   70.88     |
| OpenBookQA                   |   ca   | Question Answering        |     acc |   **33.4**  | **33.4**    |
| Parafraseja                  |   ca   | Paraphrasing              |     acc |   61.7      | **62.38**   |
| PAWS-X                       |   ca   | Paraphrasing              |     acc |   58.55     | **60.75**   |
| PiQA                         |   ca   | Question Answering        |     acc |   69.8      | **70.51**   |
| SiQA                         |   ca   | Question Answering        |     acc |   45.91     | **47.34**   |
| ARC Easy                     |   ca   | Question Answering        |     acc |   **63.93** | 59.68       |
| ARC Challenge                |   ca   | Question Answering        |     acc |      33.45  | **33.53**   |
| XQuAD                        |   ca   | Question Answering        |     f1  |    59.36    | **59.74**   |
| COQCAT                       |   ca   | Question Answering        |     f1  |    63.42    |  **66.2**   |
| CatalanQA                    |   ca   | Question Answering        |     f1  |   71.42     |  **73.24**  |
| XNLI                         |   ca   | Natural Language Inference|     acc |   48.8      | **50.24**   |
| Teca                         |   ca   | Natural Language Inference|     acc |   46.62     | **49.79**   |
| WNLI                         |   ca   | Natural Language Inference|     acc |  **57.75**  | 54.93       |
| caBreu Extractive            |   ca   | Summarization             |  rouge1 |   **50.94** | 36.21       |
| caBreu Abstractive           |   ca   | Summarization             |  bleu   |   5.27      | **7.11**    |
| caBreu Extreme               |   ca   | Summarization             |  bleu   |   1.72      | **4.4**     |
| Mgsm direct                  |   ca   | Math                      |exact match |   **0.03**  | 0        |
| VeritasQA Gen                |   ca   | Truthfulness              |  bleu      |   4.18      | **21.56**|
| VeritasQA MC1                |   ca   | Truthfulness              |  acc       |   **23.18** | 22.35    | 
| VeritasQA MC2                |   ca   | Truthfulness              |  acc       |   34.95     | **35.19**|
| Phrases ca-va                |   ca/va| Translation - Adaptation  |  bleu      |   89.12     | **90.3** |
| Phrases va-ca                |   ca/va| Translation - Adaptation  |  bleu      |   **93.23** | **92.99**|
| Belebele Cat_latn            |   es   | Reading Comprehension     |  acc       |   **25.56** |  22.33   |
| PAWS                         |   es   | Paraphrasing              |     acc    |   56.5      | **57.5** |
| Escola                       |   es   | Paraphrasing              |     acc    |   **0.02**  |   0      |
| XStoryCloze                  |   es   | Commonsense Reasoning     |     f1  |  68.46  |   **69.76**     |
| XQuAD                        |   es   | Question Answering        |     f1  |    58.85    | **63.59**   |
| XLSum                        |   es   | Summarization             |  bleu   |   0.88      | **1.79**    |
| MGSM Direct                  |   es   | Math                      |exact match |   **0.02**  | 0        |
| VeritasQA Gen                |   es   | Truthfulness              |  bleu      |   13.57      | **22.11**|
| VeritasQA MC1                |   es   | Truthfulness              |  acc       |   **23.46**  | 21.51    | 
| VeritasQA MC2                |   es   | Truthfulness              |  acc       |   **37.52**  | 34.74|
| XNLI                         |   es   | Natural Language Inference|     acc    |   46.67     | **47.87**|
| WNLI                         |   es   | Natural Language Inference|     acc    |  53.52  | **56.34**    |
| Phrases es-va                |   es/va| Translation               |  bleu      |   70.28 | **70.52**|
| Phrases va-es                |   va/es| Translation               |  bleu      |   79.63 | **79.87**|


## Additional information


### Author

Language and Information System Group [GPLSI](https://gplsi.dlsi.ua.es/)

### Contact

For further information, please send an email to [GPLSI](https://gplsi.dlsi.ua.es/)


### Copyright

Copyright(c) 2024 by GPLSI(https://gplsi.dlsi.ua.es/).

### License

[Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)

### Funding

This work was funded by [ILENIA](https://proyectoilenia.es/)-[VIVES](https://vives.gplsi.es/) project <<2022/TL22/00215334>>

### Disclaimer

The model published in this repository is intended for a generalist purpose and is available to third parties under a permissive Apache License, Version 2.0.

Be aware that the model may have biases and/or any other undesirable distortions.

When third parties deploy or provide systems and/or services to other parties using this model (or any system based on it) or become users of the model, they should note that it is their responsibility to mitigate the risks arising from its use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owner and creator of the model (GPLSI) be liable for any results arising from the use made by third parties.