Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,120 @@
|
|
1 |
-
---
|
2 |
-
license: cc-by-nc-4.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: cc-by-nc-4.0
|
3 |
+
language:
|
4 |
+
- ko
|
5 |
+
base_model:
|
6 |
+
- TwinDoc/RedWhale-tv-10.8B-v1.0
|
7 |
+
pipeline_tag: text-generation
|
8 |
+
library_name: transformers
|
9 |
+
---
|
10 |
+
|
11 |
+
|
12 |
+
# Model Card for RedWhale-tv-10.8B-ipt-v0.1
|
13 |
+
|
14 |
+
<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda__icon_RWTV.png" width="648">
|
15 |
+
|
16 |
+
|
17 |
+
## Model Description
|
18 |
+
|
19 |
+
The **RedWhale-tv-10.8B-ipt-v0.1** is an **Instruction Pre-Trained (IPT)** version of the **RedWhale-tv-10.8B-v1.0**.
|
20 |
+
|
21 |
+
The model μ¬μ©μ μνμλ©΄ repo access μμ²ν΄μ£ΌμΈμ.
|
22 |
+
|
23 |
+
## About the Model
|
24 |
+
|
25 |
+
- **Name:** TwinDoc/RedWhale-tv-10.8B-ipt-v0.1
|
26 |
+
- **Foundation Model:** RedWhale-tv-10.8B-v1.0
|
27 |
+
- **Train Corpus:** being updated
|
28 |
+
- **Developed by:** μ μμΌμλ€ (AGILESODA)
|
29 |
+
- **Model type:** mistral
|
30 |
+
- **Language(s) (NLP):** νκ΅μ΄
|
31 |
+
- **License:** cc-by-nc-sa-4.0
|
32 |
+
- **Paper:** [RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining
|
33 |
+
](https://arxiv.org/abs/2408.11294)
|
34 |
+
|
35 |
+
## Load the Model
|
36 |
+
|
37 |
+
```
|
38 |
+
from transformers import AutoTokenizer
|
39 |
+
from transformers import AutoModelForCausalLM
|
40 |
+
|
41 |
+
YOUR_HF_TOKEN_READ = "hf_..."
|
42 |
+
|
43 |
+
model_name_or_path = "TwinDoc/RedWhale-tv-10.8B-ipt-v0.1"
|
44 |
+
|
45 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
|
46 |
+
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
|
47 |
+
```
|
48 |
+
|
49 |
+
## Generate Text
|
50 |
+
|
51 |
+
```
|
52 |
+
messages = [
|
53 |
+
{'content': 'λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.', 'role': 'system'},
|
54 |
+
{'content': 'νκ΅μ μ ν΅ μμμ 무μμΈκ°μ?', 'role': 'user'}
|
55 |
+
]
|
56 |
+
|
57 |
+
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, return_tensors="pt")
|
58 |
+
# text = '<s> [INST] λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.\n\nνκ΅μ μ ν΅ μμμ 무μμΈκ°μ? [/INST]'
|
59 |
+
|
60 |
+
encodings = tokenizer(text, return_tensors='pt')
|
61 |
+
terminators = [tokenizer.eos_token_id]
|
62 |
+
max_new_tokens = 64
|
63 |
+
|
64 |
+
outputs = model.generate(**encodings, eos_token_id=terminators, max_new_tokens=max_new_tokens)
|
65 |
+
generated_text = tokenizer.batch_decode(outputs)[0]
|
66 |
+
# generated_text = '<s> [INST] λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.\n\nνκ΅μ μ ν΅ μμμ 무μμΈκ°μ? [/INST] νκ΅μ μ ν΅ μμμ λ€μν μ§μκ³Ό κ³μ μ λ°λΌ λ€μν μ’
λ₯κ° μμ΅λλ€. λνμ μΈ μ ν΅ μμμ λ€μκ³Ό κ°μ΅λλ€.\n\n1. **λΉλΉλ°₯**: λΉλΉλ°₯μ λ€μν μ¬λ£λ₯Ό μμ΄ λ§λ λ°₯ μμ μλ
μ λΏλ € λ¨Ήλ μμμ
λλ€.\n2. **κΉμΉ**: κΉμΉλ νκ΅μ λνμ μΈ λ°ν¨ μν'
|
67 |
+
```
|
68 |
+
|
69 |
+
|
70 |
+
## Generate Streaming Text
|
71 |
+
|
72 |
+
```
|
73 |
+
from transformers import TextStreamer
|
74 |
+
text_streamer = TextStreamer(tokenizer)
|
75 |
+
|
76 |
+
messages = [
|
77 |
+
{'content': 'λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.', 'role': 'system'},
|
78 |
+
{'content': 'νκ΅μ μ ν΅ μμμ 무μμΈκ°μ?', 'role': 'user'}
|
79 |
+
]
|
80 |
+
|
81 |
+
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, return_tensors="pt")
|
82 |
+
# text = '<s> [INST] λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.\n\nνκ΅μ μ ν΅ μμμ 무μμΈκ°μ? [/INST]'
|
83 |
+
|
84 |
+
encodings = tokenizer(text, return_tensors='pt')
|
85 |
+
terminators = [tokenizer.eos_token_id]
|
86 |
+
max_new_tokens = 64
|
87 |
+
|
88 |
+
outputs = model.generate(**encodings, eos_token_id=terminators, max_new_tokens=max_new_tokens)
|
89 |
+
generated_text = model.generate(**encodings, streamer = text_streamer, max_new_tokens = max_new_tokens)
|
90 |
+
# generated_text = '<s> [INST] λΉμ μ λ€μν μμ
μ λν νκ΅μ΄ μ§μΉ¨μ μ 곡νλλ‘ νλ ¨λ λ€κ΅μ΄ AI λͺ¨λΈμ
λλ€.\n\nνκ΅μ μ ν΅ μμμ 무μμΈκ°μ? [/INST] νκ΅μ μ ν΅ μμμ λ€μν μ§μκ³Ό κ³μ μ λ°λΌ λ€μν μ’
λ₯κ° μμ΅λλ€. λνμ μΈ μ ν΅ μμμ λ€μκ³Ό κ°μ΅λλ€.\n\n1. **λΉλΉλ°₯**: λΉλΉλ°₯μ λ€μν μ¬λ£λ₯Ό μμ΄ λ§λ λ°₯ μμ μλ
μ λΏλ € λ¨Ήλ μμμ
λλ€.\n2. **κΉμΉ**: κΉμΉλ νκ΅μ λνμ μΈ λ°ν¨ μν'
|
91 |
+
```
|
92 |
+
|
93 |
+
|
94 |
+
|
95 |
+
## License
|
96 |
+
|
97 |
+
<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/license__icon.png" width="324">
|
98 |
+
|
99 |
+
The content of this project, created by AGILESODA, is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/).
|
100 |
+
|
101 |
+
## Citation
|
102 |
+
|
103 |
+
```
|
104 |
+
@misc{vo2024redwhaleadaptedkoreanllm,
|
105 |
+
title={RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining},
|
106 |
+
author={Anh-Dung Vo and Minseong Jung and Wonbeen Lee and Daewoo Choi},
|
107 |
+
year={2024},
|
108 |
+
eprint={2408.11294},
|
109 |
+
archivePrefix={arXiv},
|
110 |
+
primaryClass={cs.CL},
|
111 |
+
url={https://arxiv.org/abs/2408.11294},
|
112 |
+
}
|
113 |
+
```
|
114 |
+
|
115 |
+
|
116 |
+
**Built with:**
|
117 |
+
|
118 |
+
<a href="http://www.agilesoda.com/sub/twin_doc.php">
|
119 |
+
<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda_twindoc__icon.png" alt="AgileSoda TwinDoc Icon">
|
120 |
+
</a>
|