TwinDoc commited on
Commit
e6dcf26
Β·
verified Β·
1 Parent(s): f3a6093

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -3
README.md CHANGED
@@ -1,3 +1,120 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ language:
4
+ - ko
5
+ base_model:
6
+ - TwinDoc/RedWhale-tv-10.8B-v1.0
7
+ pipeline_tag: text-generation
8
+ library_name: transformers
9
+ ---
10
+
11
+
12
+ # Model Card for RedWhale-tv-10.8B-ipt-v0.1
13
+
14
+ <img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda__icon_RWTV.png" width="648">
15
+
16
+
17
+ ## Model Description
18
+
19
+ The **RedWhale-tv-10.8B-ipt-v0.1** is an **Instruction Pre-Trained (IPT)** version of the **RedWhale-tv-10.8B-v1.0**.
20
+
21
+ The model μ‚¬μš©μ„ μ›ν•˜μ‹œλ©΄ repo access μš”μ²­ν•΄μ£Όμ„Έμš”.
22
+
23
+ ## About the Model
24
+
25
+ - **Name:** TwinDoc/RedWhale-tv-10.8B-ipt-v0.1
26
+ - **Foundation Model:** RedWhale-tv-10.8B-v1.0
27
+ - **Train Corpus:** being updated
28
+ - **Developed by:** μ• μžμΌμ†Œλ‹€ (AGILESODA)
29
+ - **Model type:** mistral
30
+ - **Language(s) (NLP):** ν•œκ΅­μ–΄
31
+ - **License:** cc-by-nc-sa-4.0
32
+ - **Paper:** [RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining
33
+ ](https://arxiv.org/abs/2408.11294)
34
+
35
+ ## Load the Model
36
+
37
+ ```
38
+ from transformers import AutoTokenizer
39
+ from transformers import AutoModelForCausalLM
40
+
41
+ YOUR_HF_TOKEN_READ = "hf_..."
42
+
43
+ model_name_or_path = "TwinDoc/RedWhale-tv-10.8B-ipt-v0.1"
44
+
45
+ tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
46
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path, token=YOUR_HF_TOKEN_READ)
47
+ ```
48
+
49
+ ## Generate Text
50
+
51
+ ```
52
+ messages = [
53
+ {'content': '당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.', 'role': 'system'},
54
+ {'content': 'ν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”?', 'role': 'user'}
55
+ ]
56
+
57
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, return_tensors="pt")
58
+ # text = '<s> [INST] 당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.\n\nν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”? [/INST]'
59
+
60
+ encodings = tokenizer(text, return_tensors='pt')
61
+ terminators = [tokenizer.eos_token_id]
62
+ max_new_tokens = 64
63
+
64
+ outputs = model.generate(**encodings, eos_token_id=terminators, max_new_tokens=max_new_tokens)
65
+ generated_text = tokenizer.batch_decode(outputs)[0]
66
+ # generated_text = '<s> [INST] 당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.\n\nν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”? [/INST] ν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ‹€μ–‘ν•œ 지역과 κ³„μ ˆμ— 따라 λ‹€μ–‘ν•œ μ’…λ₯˜κ°€ μžˆμŠ΅λ‹ˆλ‹€. λŒ€ν‘œμ μΈ 전톡 μŒμ‹μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.\n\n1. **λΉ„λΉ”λ°₯**: λΉ„λΉ”λ°₯은 λ‹€μ–‘ν•œ 재료λ₯Ό μ„žμ–΄ λ§Œλ“  λ°₯ μœ„μ— 양념을 뿌렀 λ¨ΉλŠ” μŒμ‹μž…λ‹ˆλ‹€.\n2. **κΉ€μΉ˜**: κΉ€μΉ˜λŠ” ν•œκ΅­μ˜ λŒ€ν‘œμ μΈ 발효 μ‹ν’ˆ'
67
+ ```
68
+
69
+
70
+ ## Generate Streaming Text
71
+
72
+ ```
73
+ from transformers import TextStreamer
74
+ text_streamer = TextStreamer(tokenizer)
75
+
76
+ messages = [
77
+ {'content': '당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.', 'role': 'system'},
78
+ {'content': 'ν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”?', 'role': 'user'}
79
+ ]
80
+
81
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True, return_tensors="pt")
82
+ # text = '<s> [INST] 당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.\n\nν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”? [/INST]'
83
+
84
+ encodings = tokenizer(text, return_tensors='pt')
85
+ terminators = [tokenizer.eos_token_id]
86
+ max_new_tokens = 64
87
+
88
+ outputs = model.generate(**encodings, eos_token_id=terminators, max_new_tokens=max_new_tokens)
89
+ generated_text = model.generate(**encodings, streamer = text_streamer, max_new_tokens = max_new_tokens)
90
+ # generated_text = '<s> [INST] 당신은 λ‹€μ–‘ν•œ μž‘μ—…μ— λŒ€ν•œ ν•œκ΅­μ–΄ 지침을 μ œκ³΅ν•˜λ„λ‘ ν›ˆλ ¨λœ λ‹€κ΅­μ–΄ AI λͺ¨λΈμž…λ‹ˆλ‹€.\n\nν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ¬΄μ—‡μΈκ°€μš”? [/INST] ν•œκ΅­μ˜ 전톡 μŒμ‹μ€ λ‹€μ–‘ν•œ 지역과 κ³„μ ˆμ— 따라 λ‹€μ–‘ν•œ μ’…λ₯˜κ°€ μžˆμŠ΅λ‹ˆλ‹€. λŒ€ν‘œμ μΈ 전톡 μŒμ‹μ€ λ‹€μŒκ³Ό κ°™μŠ΅λ‹ˆλ‹€.\n\n1. **λΉ„λΉ”λ°₯**: λΉ„λΉ”λ°₯은 λ‹€μ–‘ν•œ 재료λ₯Ό μ„žμ–΄ λ§Œλ“  λ°₯ μœ„μ— 양념을 뿌렀 λ¨ΉλŠ” μŒμ‹μž…λ‹ˆλ‹€.\n2. **κΉ€μΉ˜**: κΉ€μΉ˜λŠ” ν•œκ΅­μ˜ λŒ€ν‘œμ μΈ 발효 μ‹ν’ˆ'
91
+ ```
92
+
93
+
94
+
95
+ ## License
96
+
97
+ <img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/license__icon.png" width="324">
98
+
99
+ The content of this project, created by AGILESODA, is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/).
100
+
101
+ ## Citation
102
+
103
+ ```
104
+ @misc{vo2024redwhaleadaptedkoreanllm,
105
+ title={RedWhale: An Adapted Korean LLM Through Efficient Continual Pretraining},
106
+ author={Anh-Dung Vo and Minseong Jung and Wonbeen Lee and Daewoo Choi},
107
+ year={2024},
108
+ eprint={2408.11294},
109
+ archivePrefix={arXiv},
110
+ primaryClass={cs.CL},
111
+ url={https://arxiv.org/abs/2408.11294},
112
+ }
113
+ ```
114
+
115
+
116
+ **Built with:**
117
+
118
+ <a href="http://www.agilesoda.com/sub/twin_doc.php">
119
+ <img src="https://huggingface.co/TwinDoc/RedWhale-tv-10.8B-v1.0/resolve/main/company_agilesoda_twindoc__icon.png" alt="AgileSoda TwinDoc Icon">
120
+ </a>