File size: 5,123 Bytes
e6dcf26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
---
license: cc-by-nc-4.0
language:
- ko
base_model:
- TwinDoc/RedWhale-tv-10.8B-v1.0
pipeline_tag: text-generation
library_name: transformers
---


# Model Card for RedWhale-tv-10.8B-ipt-v0.1

<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda__icon_RWTV.png"  width="648">


## Model Description

The **RedWhale-tv-10.8B-ipt-v0.1** is an **Instruction Pre-Trained (IPT)** version of the **RedWhale-tv-10.8B-v1.0**.

The model μ‚¬μš©μ„ μ›ν•˜μ‹œλ©΄ repo access μš”μ²­ν•΄μ£Όμ„Έμš”.

## About the Model

- **Name:** TwinDoc/RedWhale-tv-10.8B-ipt-v0.1
- **Foundation Model:** RedWhale-tv-10.8B-v1.0
- **Train Corpus:** being updated
- **Developed by:** μ• μžμΌμ†Œλ‹€ (AGILESODA)
- **Model type:** mistral
- **Language(s) (NLP):** ν•œκ΅­μ–΄
- **License:**  cc-by-nc-sa-4.0
- **Paper:** [RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining
](https://arxiv.org/abs/2408.11294)

## Load the Model

```
from transformers import AutoTokenizer
from transformers import AutoModelForCausalLM

YOUR_HF_TOKEN_READ = "hf_..."

model_name_or_path = "TwinDoc/RedWhale-tv-10.8B-ipt-v0.1"

tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
model = AutoModelForCausalLM.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
```

## Generate Text

```
messages = [
  {'content': '당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.', 'role': 'system'},
  {'content': 'ν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”?', 'role': 'user'}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, return_tensors="pt")
# text = '<s> [INST] 당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.\n\nν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”? [/INST]'

encodings = tokenizer(text, return_tensors='pt')
terminators = [tokenizer.eos_token_id] 
max_new_tokens = 64

outputs = model.generate(**encodings, eos_token_id=terminators, max_new_tokens=max_new_tokens)
generated_text = tokenizer.batch_decode(outputs)[0]
# generated_text = '<s>  [INST] 당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.\n\nν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”? [/INST] ν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ‹€μ–‘ν•œ 지역과 κ³„μ ˆμ— 따라 λ‹€μ–‘ν•œ μ’…λ₯˜κ°€ μžˆμŠ΅λ‹ˆλ‹€. λŒ€ν‘œμ μΈ 전톡 μŒμ‹μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.\n\n1. **λΉ„λΉ”λ°₯**: λΉ„λΉ”λ°₯은 λ‹€μ–‘ν•œ 재료λ₯Ό μ„žμ–΄ λ§Œλ“  λ°₯ μœ„μ— 양념을 뿌렀 λ¨ΉλŠ” μŒμ‹μž…λ‹ˆλ‹€.\n2. **κΉ€μΉ˜**: κΉ€μΉ˜λŠ” ν•œκ΅­μ˜ λŒ€ν‘œμ μΈ 발효 μ‹ν’ˆ'
```


## Generate Streaming Text

```
from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)

messages = [
  {'content': '당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.', 'role': 'system'},
  {'content': 'ν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”?', 'role': 'user'}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, return_tensors="pt")
# text = '<s> [INST] 당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.\n\nν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”? [/INST]'

encodings = tokenizer(text, return_tensors='pt')
terminators = [tokenizer.eos_token_id] 
max_new_tokens = 64

outputs = model.generate(**encodings, eos_token_id=terminators, max_new_tokens=max_new_tokens)
generated_text = model.generate(**encodings, streamer = text_streamer, max_new_tokens = max_new_tokens)
# generated_text = '<s>  [INST] 당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.\n\nν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”? [/INST] ν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ‹€μ–‘ν•œ 지역과 κ³„μ ˆμ— 따라 λ‹€μ–‘ν•œ μ’…λ₯˜κ°€ μžˆμŠ΅λ‹ˆλ‹€. λŒ€ν‘œμ μΈ 전톡 μŒμ‹μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.\n\n1. **λΉ„λΉ”λ°₯**: λΉ„λΉ”λ°₯은 λ‹€μ–‘ν•œ 재료λ₯Ό μ„žμ–΄ λ§Œλ“  λ°₯ μœ„μ— 양념을 뿌렀 λ¨ΉλŠ” μŒμ‹μž…λ‹ˆλ‹€.\n2. **κΉ€μΉ˜**: κΉ€μΉ˜λŠ” ν•œκ΅­μ˜ λŒ€ν‘œμ μΈ 발효 μ‹ν’ˆ'
```



## License

<img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/license__icon.png"  width="324">

The content of this project, created by AGILESODA, is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/).

## Citation

```
@misc{vo2024redwhaleadaptedkoreanllm,
      title={RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining}, 
      author={Anh-Dung Vo and Minseong Jung and Wonbeen Lee and Daewoo Choi},
      year={2024},
      eprint={2408.11294},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.11294}, 
}
```


**Built with:**

<a href="http://www.agilesoda.com/sub/twin_doc.php">
    <img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda_twindoc__icon.png" alt="AgileSoda TwinDoc Icon">
</a>