|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- ko |
|
base_model: |
|
- TwinDoc/RedWhale-tv-10.8B-v1.0 |
|
pipeline_tag: text-generation |
|
library_name: transformers |
|
--- |
|
|
|
|
|
# Model Card for RedWhale-tv-10.8B-ipt-v0.1 |
|
|
|
<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda__icon_RWTV.png" width="648"> |
|
|
|
|
|
## Model Description |
|
|
|
The **RedWhale-tv-10.8B-ipt-v0.1** is an **Instruction Pre-Trained (IPT)** version of the **RedWhale-tv-10.8B-v1.0**. |
|
|
|
The model μ¬μ©μ μνμλ©΄ repo access μμ²ν΄μ£ΌμΈμ. |
|
|
|
## About the Model |
|
|
|
- **Name:** TwinDoc/RedWhale-tv-10.8B-ipt-v0.1 |
|
- **Foundation Model:** RedWhale-tv-10.8B-v1.0 |
|
- **Train Corpus:** being updated |
|
- **Developed by:** μ μμΌμλ€ (AGILESODA) |
|
- **Model type:** mistral |
|
- **Language(s) (NLP):** νκ΅μ΄ |
|
- **License:** cc-by-nc-sa-4.0 |
|
- **Paper:** [RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining |
|
](https://arxiv.org/abs/2408.11294) |
|
|
|
## Load the Model |
|
|
|
``` |
|
from transformers import AutoTokenizer |
|
from transformers import AutoModelForCausalLM |
|
|
|
YOUR_HF_TOKEN_READ = "hf_..." |
|
|
|
model_name_or_path = "TwinDoc/RedWhale-tv-10.8B-ipt-v0.1" |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ) |
|
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ) |
|
``` |
|
|
|
## Generate Text |
|
|
|
``` |
|
messages = [ |
|
{'content': 'λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.', 'role': 'system'}, |
|
{'content': 'νκ΅μ μ ν΅ μμμ 무μμΈκ°μ?', 'role': 'user'} |
|
] |
|
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, return_tensors="pt") |
|
# text = '<s> [INST] λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.\n\nνκ΅μ μ ν΅ μμμ 무μμΈκ°μ? [/INST]' |
|
|
|
encodings = tokenizer(text, return_tensors='pt') |
|
terminators = [tokenizer.eos_token_id] |
|
max_new_tokens = 64 |
|
|
|
outputs = model.generate(**encodings, eos_token_id=terminators, max_new_tokens=max_new_tokens) |
|
generated_text = tokenizer.batch_decode(outputs)[0] |
|
# generated_text = '<s> [INST] λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.\n\nνκ΅μ μ ν΅ μμμ 무μμΈκ°μ? [/INST] νκ΅μ μ ν΅ μμμ λ€μν μ§μκ³Ό κ³μ μ λ°λΌ λ€μν μ’
λ₯κ° μμ΅λλ€. λνμ μΈ μ ν΅ μμμ λ€μκ³Ό κ°μ΅λλ€.\n\n1. **λΉλΉλ°₯**: λΉλΉλ°₯μ λ€μν μ¬λ£λ₯Ό μμ΄ λ§λ λ°₯ μμ μλ
μ λΏλ € λ¨Ήλ μμμ
λλ€.\n2. **κΉμΉ**: κΉμΉλ νκ΅μ λνμ μΈ λ°ν¨ μν' |
|
``` |
|
|
|
|
|
## Generate Streaming Text |
|
|
|
``` |
|
from transformers import TextStreamer |
|
text_streamer = TextStreamer(tokenizer) |
|
|
|
messages = [ |
|
{'content': 'λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.', 'role': 'system'}, |
|
{'content': 'νκ΅μ μ ν΅ μμμ 무μμΈκ°μ?', 'role': 'user'} |
|
] |
|
|
|
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, return_tensors="pt") |
|
# text = '<s> [INST] λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.\n\nνκ΅μ μ ν΅ μμμ 무μμΈκ°μ? [/INST]' |
|
|
|
encodings = tokenizer(text, return_tensors='pt') |
|
terminators = [tokenizer.eos_token_id] |
|
max_new_tokens = 64 |
|
|
|
outputs = model.generate(**encodings, eos_token_id=terminators, max_new_tokens=max_new_tokens) |
|
generated_text = model.generate(**encodings, streamer = text_streamer, max_new_tokens = max_new_tokens) |
|
# generated_text = '<s> [INST] λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.\n\nνκ΅μ μ ν΅ μμμ 무μμΈκ°μ? [/INST] νκ΅μ μ ν΅ μμμ λ€μν μ§μκ³Ό κ³μ μ λ°λΌ λ€μν μ’
λ₯κ° μμ΅λλ€. λνμ μΈ μ ν΅ μμμ λ€μκ³Ό κ°μ΅λλ€.\n\n1. **λΉλΉλ°₯**: λΉλΉλ°₯μ λ€μν μ¬λ£λ₯Ό μμ΄ λ§λ λ°₯ μμ μλ
μ λΏλ € λ¨Ήλ μμμ
λλ€.\n2. **κΉμΉ**: κΉμΉλ νκ΅μ λνμ μΈ λ°ν¨ μν' |
|
``` |
|
|
|
|
|
|
|
## License |
|
|
|
<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/license__icon.png" width="324"> |
|
|
|
The content of this project, created by AGILESODA, is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{vo2024redwhaleadaptedkoreanllm, |
|
title={RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining}, |
|
author={Anh-Dung Vo and Minseong Jung and Wonbeen Lee and Daewoo Choi}, |
|
year={2024}, |
|
eprint={2408.11294}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2408.11294}, |
|
} |
|
``` |
|
|
|
|
|
**Built with:** |
|
|
|
<a href="http://www.agilesoda.com/sub/twin_doc.php"> |
|
<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda_twindoc__icon.png" alt="AgileSoda TwinDoc Icon"> |
|
</a> |