File size: 6,926 Bytes
8da1560 f503b95 f72c6f1 f503b95 608ede5 c2fc322 f503b95 51ca017 f503b95 f72c6f1 f503b95 26a395b f503b95 8c1ce9a f503b95 3cefbdb f503b95 3cefbdb f503b95 8da1560 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
---
license: llama2
datasets:
- snow_simplified_japanese_corpus
- khalidalt/tydiqa-goldp
- csebuetnlp/xlsum
language:
- ja
---
# About
This model is Lightblue's QLoRA finetune of OpenOrca's [Open-Orca/OpenOrcaxOpenChat-Preview2-13B](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B) model on Japanese fine-tuning datasets.
This model specialises on answering **Closed Question Answering** in Japanese. Input a piece of reference text, ask a question, and see the model answer based on the reference text.
We trained on equal samples of the following three datasets:
* [SNOW](https://huggingface.co/datasets/snow_simplified_japanese_corpus)
* [TyDiQA (Ja)](https://huggingface.co/datasets/khalidalt/tydiqa-goldp)
* [XLSUM (Ja)](https://huggingface.co/datasets/csebuetnlp/xlsum)
which resulted in a dataset of 13,167 samples total.
These three datasets were chosen as they represent three distinct fine-tuning tasks (Text simplification, question answering, and text summarization, respectively) which we hypothesize can help to improve the language models suitability for dealing with Japanese data.
These three datasets make up the model name: STX.
With these datasets, we achieve the following scores on the JGLUE benchmark:
| Model Name | Open-Orca/OpenOrcaxOpenChat-Preview2-13B | lightblue/openorca_stx |
|------------------------|------------------------------------------|------------------------|
| jsquad-1.1-0.3 | 0.692 | 0.836 |
| jcommonsenseqa-1.1-0.3 | 0.831 | 0.782 |
| jnli-1.1-0.3 | 0.504 | 0.48 |
| marc_ja-1.1-0.3 | 0.936 | 0.959 |
Our model achieves much better results on the question answering benchmark (JSQuAD) than the base checkpoint without monstrous degradation of performance on multi-choice question benchmarks (JCommonSense, JNLI, MARC-Ja) purely through QLoRA training.
This shows the potential for applying strong language models such as [Open-Orca/OpenOrcaxOpenChat-Preview2-13B](https://huggingface.co/Open-Orca/OpenOrcaxOpenChat-Preview2-13B) to minimal QLoRA fine-tuning using Japanese fine-tuning datasets to achieve better results at narrow NLP tasks.
# How to use
```python
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(
model_dir, torch_dtype=torch.bfloat16, device_map='auto',
)
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
def do_closed_qa(context, question):
return context + "\n\n" + question
test_article = """ใใขใใใใฎใฌใใผใใชใผใซใใชใผใใปใใคใฑใซ้ธๆใใใใใฌใคใถใผใฉใขใณRGใใใๆฌไบบๅ
ฌ่ชใฎใขใใใใงใใใใฉใฐใใผใใกใณใฎๅๅฟใซๅฐใ้ฉใใใใใงใใ
ใใชใผใใปใใคใฑใซ้ธๆใฎใขใใใใฏใไฝใใใฃใใใงใใใ
ใ2015ๅนดใฎใฏใผใซใใซใใ๏ผWๆฏ๏ผใคใณใฐใฉใณใๅคงไผใงๆฅๆฌใๅใขใใชใซใๅใใๆฌกใฎๆฅใใไบฌ้ฝใงใฎ็ช็ตใญใฑใงใใใๅฝๆใฏใใขใใใซใฎๅ
ฑๅๅตๆฅญ่
ในใใฃใผใใปใธใงใใบใฎใขใใใใฐใใใงใใใใไธ็ทใซใญใฑใใใฆใใใธใฃใณใฐใซใใฑใใใใใใชใผใใปใใคใฑใซใซไผผใฆใพใใใใธใงใใบใฎใพใพใใใใใใใใชใใงใใ๏ผใใจ่จใใใใฎใๅงใพใใงใใ
ใใใ ใใฟใใช็ฅ่ญใใชใใใฉใฐใใผใทใงใใใๆขใใๆฅๆฌไปฃ่กจใฎใฆใใใผใ ใๅฃฒใๅใใ ใฃใใฎใงใ่ตคใฃใฝใใฆใใใผใ ใจใใใใใฎ็ญใใณใใฏใใฆใใจใใใใSNSใงใใชใผใใปใใคใฑใซใงใใใฃใฆใใฃใฑใๅ็ใ่ผใใพใใใ
ใใใใจใใใใ่ฆใใชใผใใใๆฌไบบใใDM๏ผใใคใฌใฏใใกใใปใผใธ๏ผใๅฑใใพใใใใใขใใใใใใใจใใใใใพใใใใใขใใใใใใใชใใๅใฎใฆใใใผใ ใ้ใใพใใฎใง็ใฆใใ ใใใใจใWๆฏๅพใซใฆใใใผใ 2็ใจใใณใใใฝใใฏในใชใฉใใปใใพใซ้ใฃใฆใใฆใใใพใใใไป็ใฆใใใฎใใใใงใใ
ใใใพใงใๆฐใ
ใฎ่ๅไบบใใขใใใใใฆใใใใพใใใใชใผใ้ธๆใฎใใฟใฎๅ้ฟใฏใใใใงใใใใ
ใใๅใฏใฉใฐใใผ็ต้จใใชใใงใใใใฉใฐใใผใๅ
จ็ถ็ฅใใชใใฃใใใฉใใใฃใฑใๆฌไบบใใใฆใใใผใ ใ้ ใใฆใใฃใฆใใโๅฐ็ฑ ๏ผใใใใ๏ผโใฟใใใชใฎใใใฃใฆใใใใใคใฏใชใผใใใๆฌไบบใซ่ชใใใใฆใใใจใไธ็ฎ็ฝฎใใใฆใใใฎใใชใจๆใใพใใ
ใใใใฃใฆใใใใจใฏใ่ฆใ็ฎใๆฌไบบใซๅฏใใฆใฏใณใใผใ ใฃใฆ่จใใ ใใชใใงใใใฉใญใใใใงใใใใใใชใผใใใใ ใใจ่จใฃใฆใใใใพใใ
ใใใชใผใใใใจๅฎ้ใซไผใใใจใชใใฆใ็ฐกๅใซใฏใงใใชใใใใชใใงใใใใงใใใชใผใใใใฎใพใญใใใฆใใRGใซใฏไผใใใใใฟใใใช๏ผ็ฌ๏ผใไฝใ ใใใชใๆๅใช็ฅ็คพใฎๆฏ็คพใฎใใใชๅญๅจใงใใใญใใใใใใใใใใจใใๆๅณใงใฏไปใฎใขใใใใจใฏใใใ้ใใพใใญใ
"""
test_question = "ใใชใผใใปใใคใฑใซใฏไฝใ้ใฃใฆใใพใใใ๏ผ"
pipe(do_closed_qa(test_article, question), max_new_tokens=128, temperature=0)[0]["generated_text"]
# "ใฆใใใผใ 2็ใจใใณใใใฝใใฏในใชใฉ"
```
# Training details
We trained using the following three minimalistic prompt templates for the three tasks in STX:
* SNOW
```
f"""ๅ
ใฎๆฅๆฌ่ช๏ผ
{original_ja}
ใทใณใใซใชๆฅๆฌ่ช๏ผ"""
```
* TyDiQA
```
f"""{passage_text}
{question_text}"""
```
```
* XLSum
```
f"""่จไบ๏ผ
{original_ja}
่ฆ็ด๏ผ"""
```
This model was trained for 1000 steps (1.2 epochs) with the model being evaluated every 50 steps. We then chose the best model from these evaluations based on validation loss.
We used the [qlora](https://github.com/artidoro/qlora) package from artidoro.
We trained with the following hyperparameters:
```
Per device evaluation batch size: 16
Per device train batch size: 8
LoRA (lora_r): 64
LoRA alpha (lora_alpha): 16
LoRA modules: all
Double quantization: Enabled
Quantization type: nf4
BF16: Enabled
Bits: 4
Warmup ratio: 0.03
Learning rate scheduler type: Constant
Gradient checkpointing: Enabled
Gradient accumulation steps: 2
Learning rate: 0.0002
Adam beta2: 0.999
Maximum gradient norm: 0.3
LoRA dropout: 0.05
Weight decay: 0.0
```

 |