AIDXteam commited on
Commit
34bbb24
Β·
verified Β·
1 Parent(s): 9f55707

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +210 -3
README.md CHANGED
@@ -1,3 +1,210 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ko
5
+ base_model:
6
+ - openchat/openchat_3.5
7
+ pipeline_tag: text-generation
8
+ ---
9
+ ### β›± ktdsbaseLM v0.11은 openchat3.5λ₯Ό Foundation λͺ¨λΈλ‘œ ν•˜λŠ” ν•œκ΅­μ–΄ 및 ν•œκ΅­μ˜ λ‹€μ–‘ν•œ
10
+ ### 문화에 μ μš©ν•  수 μžˆλ„λ‘ ν•˜κΈ° μœ„ν•΄
11
+ ### 개발 λ˜μ—ˆμœΌλ©° 자체 μ œμž‘ν•œ 53μ˜μ—­μ˜ ν•œκ΅­μ–΄ 데이터λ₯Ό ν™œμš©ν•˜μ—¬ ν•œκ΅­ μ‚¬νšŒ κ°€μΉ˜μ™€
12
+ ### λ¬Έν™”λ₯Ό μ΄ν•΄ν•˜λŠ” λͺ¨λΈ μž…λ‹ˆλ‹€. ✌
13
+
14
+
15
+
16
+ # ❢ λͺ¨λΈ μ„€λͺ…
17
+ - λͺ¨λΈλͺ… 및 μ£Όμš”κΈ°λŠ₯:
18
+ KTDSbaseLM v0.11은 OpenChat 3.5 λͺ¨λΈμ„ 기반으둜 SFT λ°©μ‹μœΌλ‘œ νŒŒμΈνŠœλ‹λœ Mistral 7B / openchat3.5 기반 λͺ¨λΈμž…λ‹ˆλ‹€.
19
+ ν•œκ΅­μ–΄μ™€ ν•œκ΅­μ˜ λ‹€μ–‘ν•œ 문화적 λ§₯락을 μ΄ν•΄ν•˜λ„λ‘ μ„€κ³„λ˜μ—ˆμœΌλ©° ✨✨, 자체 μ œμž‘ν•œ 135개 μ˜μ—­μ˜ ν•œκ΅­μ–΄
20
+ 데이터λ₯Ό ν™œμš©ν•΄ ν•œκ΅­ μ‚¬νšŒμ˜ κ°€μΉ˜μ™€ λ¬Έν™”λ₯Ό λ°˜μ˜ν•©λ‹ˆλ‹€.
21
+ μ£Όμš” κΈ°λŠ₯μœΌλ‘œλŠ” ν…μŠ€νŠΈ 생성, λŒ€ν™” μΆ”λ‘ , λ¬Έμ„œ μš”μ•½, μ§ˆμ˜μ‘λ‹΅, 감정 뢄석 및 μžμ—°μ–΄ 처리 κ΄€λ ¨ λ‹€μ–‘ν•œ μž‘μ—…μ„ μ§€μ›ν•˜λ©°,
22
+ ν™œμš© λΆ„μ•ΌλŠ” 법λ₯ , 재무, κ³Όν•™, ꡐ윑, λΉ„μ¦ˆλ‹ˆμŠ€, λ¬Έν™” 연ꡬ λ“± λ‹€μ–‘ν•œ λΆ„μ•Όμ—μ„œ μ‘μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€.
23
+ - λͺ¨λΈ μ•„ν‚€ν…μ²˜: KTDSBaseLM v0.11은 Mistral 7B λͺ¨λΈμ„ 기반으둜, νŒŒλΌλ―Έν„° μˆ˜λŠ” 70μ–΅ 개(7B)둜 κ΅¬μ„±λœ κ³ μ„±λŠ₯ μ–Έμ–΄ λͺ¨λΈμž…λ‹ˆλ‹€.
24
+ 이 λͺ¨λΈμ€ OpenChat 3.5λ₯Ό νŒŒμš΄λ°μ΄μ…˜ λͺ¨λΈλ‘œ μ‚Όμ•„, SFT(지도 λ―Έμ„Έ μ‘°μ •) 방식을 톡해 ν•œκ΅­μ–΄μ™€ ν•œκ΅­ 문화에 νŠΉν™”λœ μ„±λŠ₯을 λ°œνœ˜ν•˜λ„λ‘ ν›ˆλ ¨λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
25
+ Mistral 7B의 κ²½λŸ‰ν™”λœ κ΅¬μ‘°λŠ” λΉ λ₯Έ μΆ”λ‘  속도와 λ©”λͺ¨λ¦¬ νš¨μœ¨μ„±μ„ 보μž₯ν•˜λ©°, λ‹€μ–‘ν•œ μžμ—°μ–΄ 처리 μž‘μ—…μ— μ ν•©ν•˜κ²Œ μ΅œμ ν™”λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.
26
+ 이 μ•„ν‚€ν…μ²˜λŠ” ν…μŠ€νŠΈ 생성, μ§ˆμ˜μ‘λ‹΅, λ¬Έμ„œ μš”μ•½, 감정 뢄석과 같은 λ‹€μ–‘ν•œ μž‘μ—…μ—μ„œ νƒμ›”ν•œ μ„±λŠ₯을 λ³΄μ—¬μ€λ‹ˆλ‹€.
27
+
28
+ # ❷ ν•™μŠ΅ 데이터
29
+ - ktdsbaseLM v0.11은 자체 κ°œλ°œν•œ 총 3.6GB 크기의 데이터λ₯Ό λ°”νƒ•μœΌλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λͺ¨λ‘ 233만 건의 QnA, μš”μ•½, λΆ„λ₯˜ λ“± 데이터λ₯Ό ν¬ν•¨ν•˜λ©°,
30
+ κ·Έ 쀑 133만 건은 53개 μ˜μ—­μ˜ 객관식 문제둜 κ΅¬μ„±λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 이 μ˜μ—­μ—λŠ” ν•œκ΅­μ‚¬, μ‚¬νšŒ, 재무, 법λ₯ , 세무, μˆ˜ν•™, 생물, 물리, ν™”ν•™ 등이 ν¬ν•¨λ˜λ©°,
31
+ Chain of Thought λ°©μ‹μœΌλ‘œ ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ 130만 건의 주관식 λ¬Έμ œλŠ” ν•œκ΅­μ‚¬, 재무, 법λ₯ , 세무, μˆ˜ν•™ λ“± 38개 μ˜μ—­μ— 걸쳐 ν•™μŠ΅λ˜μ—ˆμŠ΅λ‹ˆλ‹€.
32
+ ν•™μŠ΅ 데이터 쀑 ν•œκ΅­μ˜ μ‚¬νšŒ κ°€μΉ˜μ™€ μΈκ°„μ˜ 감정을 μ΄ν•΄ν•˜κ³  μ§€μ‹œν•œ 사항에 따라 좜λ ₯ν•  수 μžˆλŠ” 데이터λ₯Ό ν•™μŠ΅ν•˜μ˜€μŠ΅λ‹ˆλ‹€.
33
+ - ν•™μŠ΅ Instruction Datasets Format:
34
+ <pre><code>{"prompt": "prompt text", "completion": "ideal generated text"}</code></pre>
35
+
36
+ # ❸ μ‚¬μš© 사둀
37
+ ktdsbaseLM v0.11은 λ‹€μ–‘ν•œ μ‘μš© λΆ„μ•Όμ—μ„œ μ‚¬μš©λ  수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄:
38
+ - ꡐ윑 λΆ„μ•Ό: 역사, μˆ˜ν•™, κ³Όν•™ λ“± λ‹€μ–‘ν•œ ν•™μŠ΅ μžλ£Œμ— λŒ€ν•œ μ§ˆμ˜μ‘λ‹΅ 및 μ„€λͺ… 생성.
39
+ - λΉ„μ¦ˆλ‹ˆμŠ€: 법λ₯ , 재무, 세무 κ΄€λ ¨ μ§ˆμ˜μ— λŒ€ν•œ λ‹΅λ³€ 제곡 및 λ¬Έμ„œ μš”μ•½.
40
+ - 연ꡬ 및 λ¬Έν™”: ν•œκ΅­ μ‚¬νšŒμ™€ 문화에 맞좘 μžμ—°μ–΄ 처리 μž‘μ—…, 감정 뢄석, λ¬Έμ„œ 생성 및 λ²ˆμ—­.
41
+ - 고객 μ„œλΉ„μŠ€: μ‚¬μš©μžμ™€μ˜ λŒ€ν™” 생성 및 λ§žμΆ€ν˜• 응닡 제곡.
42
+ - 이 λͺ¨λΈμ€ λ‹€μ–‘ν•œ μžμ—°μ–΄ 처리 μž‘μ—…μ—μ„œ 높은 ν™œμš©λ„λ₯Ό κ°€μ§‘λ‹ˆλ‹€.
43
+
44
+ # ❹ ν•œκ³„ β›ˆβ›ˆ
45
+ - ktdsBaseLM v0.11은 ν•œκ΅­μ–΄μ™€ ν•œκ΅­ 문화에 νŠΉν™”λ˜μ–΄ μžˆμœΌλ‚˜,
46
+ νŠΉμ • μ˜μ—­(예: μ΅œμ‹  ꡭ제 자료, μ „λ¬Έ λΆ„μ•Ό)의 데이터 λΆ€μ‘±μœΌλ‘œ 인해 λ‹€λ₯Έ μ–Έμ–΄ λ˜λŠ”
47
+ 문화에 λŒ€ν•œ μ‘λ‹΅μ˜ 정확성이 λ–¨μ–΄μ§ˆ 수 μžˆμŠ΅λ‹ˆλ‹€.
48
+ λ˜ν•œ, λ³΅μž‘ν•œ 논리적 사고λ₯Ό μš”κ΅¬ν•˜λŠ” λ¬Έμ œμ— λŒ€ν•΄ μ œν•œλœ μΆ”λ‘  λŠ₯λ ₯을 보일 수 있으며,
49
+ 편ν–₯된 데이터가 포함될 경우 편ν–₯된 응닡이 생성될 κ°€λŠ₯성도 μ‘΄μž¬ν•©λ‹ˆλ‹€.
50
+
51
+ # ❺ μ‚¬μš© 방법
52
+ <pre><code>
53
+ import os
54
+ import os.path as osp
55
+ import sys
56
+ import fire
57
+ import json
58
+ from typing import List, Union
59
+ import pandas as pd
60
+ import torch
61
+ from torch.nn import functional as F
62
+
63
+ import transformers
64
+ from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
65
+ from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
66
+ from transformers import LlamaForCausalLM, LlamaTokenizer
67
+ from transformers import AutoModelForCausalLM, AutoTokenizer
68
+
69
+ from datasets import load_dataset
70
+
71
+ from peft import (
72
+ LoraConfig,
73
+ get_peft_model,
74
+ set_peft_model_state_dict
75
+ )
76
+ from peft import PeftModel
77
+ import re
78
+ import ast
79
+
80
+ device = 'auto' #@param {type: "string"}
81
+ model = '' #@param {type: "string"}
82
+ model = AutoModelForCausalLM.from_pretrained(
83
+ model,
84
+ quantization_config=bnb_config,
85
+ #load_in_4bit=True, # Quantization Load
86
+ device_map=device)
87
+
88
+ tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)
89
+
90
+ input_text = "μ•ˆλ…•ν•˜μ„Έμš”."
91
+ inputs = tokenizer(input_text, return_tensors="pt")
92
+ inputs = inputs.to("cuda:0")
93
+
94
+ with torch.no_grad():
95
+ outputs = model.generate(**inputs, max_length=1024)
96
+
97
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
98
+
99
+ </code></pre>
100
+
101
+ ## βœ… ktdsλŠ” openchat 외에도 LlaMA, Polyglot, EEVE λ“± λŒ€ν‘œμ μΈ LLM에 λ‹€μ–‘ν•œ μ˜μ—­μ˜ ν•œκ΅­μ˜ 문화와 지식을 νŒŒμΈνŠœλ‹ν•œ LLM을 μ œκ³΅ν•  μ˜ˆμ •μž…λ‹ˆλ‹€.
102
+ ---
103
+ Here’s the English version of the provided text:
104
+
105
+
106
+
107
+ # ❢ Model Description
108
+
109
+ **Model Name and Key Features**:
110
+ KTDSbaseLM v0.11 is based on the OpenChat 3.5 model, fine-tuned using the SFT method on the Mistral 7B model.
111
+ It is designed to understand Korean and various cultural contexts, utilizing data from 135 domains in Korean society.
112
+ The model supports tasks such as text generation, conversation inference, document summarization,
113
+ question answering, sentiment analysis, and other NLP tasks.
114
+ Its applications span fields like law, finance, science, education, business, and cultural research.
115
+
116
+ **Model Architecture**:
117
+ KTDSBaseLM v0.11 is a high-performance language model with 7 billion parameters based on the Mistral 7B model.
118
+ It uses OpenChat 3.5 as the foundation and is fine-tuned using SFT to excel in Korean language and culture.
119
+ The streamlined Mistral 7B architecture ensures fast inference and memory efficiency,
120
+ optimized for various NLP tasks like text generation, question answering, document summarization, and sentiment analysis.
121
+
122
+ ---
123
+
124
+ # ❷ Training Data
125
+
126
+ KTDSbaseLM v0.11 was trained on 3.6GB of data, comprising 2.33 million Q&A instances.
127
+ This includes 1.33 million multiple-choice questions across 53 domains such as history,
128
+ finance, law, tax, and science, trained with the Chain of Thought method. Additionally,
129
+ 1.3 million short-answer questions cover 38 domains including history, finance, and law.
130
+
131
+ **Training Instruction Dataset Format**:
132
+ `{"prompt": "prompt text", "completion": "ideal generated text"}`
133
+
134
+ ---
135
+
136
+ # ❸ Use Cases
137
+
138
+ KTDSbaseLM v0.11 can be used across multiple fields, such as:
139
+
140
+ - **Education**: Answering questions and generating explanations for subjects like history, math, and science.
141
+ - **Business**: Providing responses and summaries for legal, financial, and tax-related queries.
142
+ - **Research and Culture**: Performing NLP tasks, sentiment analysis, document generation, and translation.
143
+ - **Customer Service**: Generating conversations and personalized responses for users.
144
+
145
+ This model is highly versatile in various NLP tasks.
146
+
147
+ ---
148
+
149
+ # ❹ Limitations
150
+
151
+ KTDSBaseLM v0.11 is specialized in Korean language and culture.
152
+ However, it may lack accuracy in responding to topics outside its scope,
153
+ such as international or specialized data.
154
+ Additionally, it may have limited reasoning ability for complex logical problems and
155
+ may produce biased responses if trained on biased data.
156
+
157
+ ---
158
+
159
+ # ❺ Usage Instructions
160
+ <pre><code>
161
+ import os
162
+ import os.path as osp
163
+ import sys
164
+ import fire
165
+ import json
166
+ from typing import List, Union
167
+ import pandas as pd
168
+ import torch
169
+ from torch.nn import functional as F
170
+
171
+ import transformers
172
+ from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
173
+ from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
174
+ from transformers import LlamaForCausalLM, LlamaTokenizer
175
+ from transformers import AutoModelForCausalLM, AutoTokenizer
176
+
177
+ from datasets import load_dataset
178
+
179
+ from peft import (
180
+ LoraConfig,
181
+ get_peft_model,
182
+ set_peft_model_state_dict
183
+ )
184
+ from peft import PeftModel
185
+ import re
186
+ import ast
187
+
188
+ device = 'auto' #@param {type: "string"}
189
+ model = '' #@param {type: "string"}
190
+ model = AutoModelForCausalLM.from_pretrained(
191
+ model,
192
+ quantization_config=bnb_config,
193
+ #load_in_4bit=True, # Quantization Load
194
+ device_map=device)
195
+
196
+ tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)
197
+
198
+ input_text = "μ•ˆλ…•ν•˜μ„Έμš”."
199
+ inputs = tokenizer(input_text, return_tensors="pt")
200
+ inputs = inputs.to("cuda:0")
201
+
202
+ with torch.no_grad():
203
+ outputs = model.generate(**inputs, max_length=1024)
204
+
205
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
206
+ </code></pre>
207
+
208
+ ## KTDS plans to provide fine-tuned LLMs (Large Language Models) across various domains of Korean culture and knowledge,
209
+ ## including models based on not only OpenChat but also LLaMA, Polyglot, and EEVE.
210
+ ## These models will be tailored to better understand and generate content specific to Korean contexts.