Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,210 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- ko
|
5 |
+
base_model:
|
6 |
+
- openchat/openchat_3.5
|
7 |
+
pipeline_tag: text-generation
|
8 |
+
---
|
9 |
+
### β± ktdsbaseLM v0.11μ openchat3.5λ₯Ό Foundation λͺ¨λΈλ‘ νλ νκ΅μ΄ λ° νκ΅μ λ€μν
|
10 |
+
### λ¬Ένμ μ μ©ν μ μλλ‘ νκΈ° μν΄
|
11 |
+
### κ°λ° λμμΌλ©° μ체 μ μν 53μμμ νκ΅μ΄ λ°μ΄ν°λ₯Ό νμ©νμ¬ νκ΅ μ¬ν κ°μΉμ
|
12 |
+
### λ¬Ένλ₯Ό μ΄ν΄νλ λͺ¨λΈ μ
λλ€. β
|
13 |
+
|
14 |
+
|
15 |
+
|
16 |
+
# βΆ λͺ¨λΈ μ€λͺ
|
17 |
+
- λͺ¨λΈλͺ
λ° μ£ΌμκΈ°λ₯:
|
18 |
+
KTDSbaseLM v0.11μ OpenChat 3.5 λͺ¨λΈμ κΈ°λ°μΌλ‘ SFT λ°©μμΌλ‘ νμΈνλλ Mistral 7B / openchat3.5 κΈ°λ° λͺ¨λΈμ
λλ€.
|
19 |
+
νκ΅μ΄μ νκ΅μ λ€μν λ¬Ένμ λ§₯λ½μ μ΄ν΄νλλ‘ μ€κ³λμμΌλ©° β¨β¨, μ체 μ μν 135κ° μμμ νκ΅μ΄
|
20 |
+
λ°μ΄ν°λ₯Ό νμ©ν΄ νκ΅ μ¬νμ κ°μΉμ λ¬Ένλ₯Ό λ°μν©λλ€.
|
21 |
+
μ£Όμ κΈ°λ₯μΌλ‘λ ν
μ€νΈ μμ±, λν μΆλ‘ , λ¬Έμ μμ½, μ§μμλ΅, κ°μ λΆμ λ° μμ°μ΄ μ²λ¦¬ κ΄λ ¨ λ€μν μμ
μ μ§μνλ©°,
|
22 |
+
νμ© λΆμΌλ λ²λ₯ , μ¬λ¬΄, κ³Όν, κ΅μ‘, λΉμ¦λμ€, λ¬Έν μ°κ΅¬ λ± λ€μν λΆμΌμμ μμ©λ μ μμ΅λλ€.
|
23 |
+
- λͺ¨λΈ μν€ν
μ²: KTDSBaseLM v0.11μ Mistral 7B λͺ¨λΈμ κΈ°λ°μΌλ‘, νλΌλ―Έν° μλ 70μ΅ κ°(7B)λ‘ κ΅¬μ±λ κ³ μ±λ₯ μΈμ΄ λͺ¨λΈμ
λλ€.
|
24 |
+
μ΄ λͺ¨λΈμ OpenChat 3.5λ₯Ό νμ΄λ°μ΄μ
λͺ¨λΈλ‘ μΌμ, SFT(μ§λ λ―ΈμΈ μ‘°μ ) λ°©μμ ν΅ν΄ νκ΅μ΄μ νκ΅ λ¬Ένμ νΉνλ μ±λ₯μ λ°ννλλ‘ νλ ¨λμμ΅λλ€.
|
25 |
+
Mistral 7Bμ κ²½λνλ ꡬ쑰λ λΉ λ₯Έ μΆλ‘ μλμ λ©λͺ¨λ¦¬ ν¨μ¨μ±μ 보μ₯νλ©°, λ€μν μμ°μ΄ μ²λ¦¬ μμ
μ μ ν©νκ² μ΅μ νλμ΄ μμ΅λλ€.
|
26 |
+
μ΄ μν€ν
μ²λ ν
μ€νΈ μμ±, μ§μμλ΅, λ¬Έμ μμ½, κ°μ λΆμκ³Ό κ°μ λ€μν μμ
μμ νμν μ±λ₯μ 보μ¬μ€λλ€.
|
27 |
+
|
28 |
+
# β· νμ΅ λ°μ΄ν°
|
29 |
+
- ktdsbaseLM v0.11μ μ체 κ°λ°ν μ΄ 3.6GB ν¬κΈ°μ λ°μ΄ν°λ₯Ό λ°νμΌλ‘ νμ΅λμμ΅λλ€. λͺ¨λ 233λ§ κ±΄μ QnA, μμ½, λΆλ₯ λ± λ°μ΄ν°λ₯Ό ν¬ν¨νλ©°,
|
30 |
+
κ·Έ μ€ 133λ§ κ±΄μ 53κ° μμμ κ°κ΄μ λ¬Έμ λ‘ κ΅¬μ±λμμ΅λλ€. μ΄ μμμλ νκ΅μ¬, μ¬ν, μ¬λ¬΄, λ²λ₯ , μΈλ¬΄, μν, μλ¬Ό, 물리, νν λ±μ΄ ν¬ν¨λλ©°,
|
31 |
+
Chain of Thought λ°©μμΌλ‘ νμ΅λμμ΅λλ€. λν 130λ§ κ±΄μ μ£Όκ΄μ λ¬Έμ λ νκ΅μ¬, μ¬λ¬΄, λ²λ₯ , μΈλ¬΄, μν λ± 38κ° μμμ κ±Έμ³ νμ΅λμμ΅λλ€.
|
32 |
+
νμ΅ λ°μ΄ν° μ€ νκ΅μ μ¬ν κ°μΉμ μΈκ°μ κ°μ μ μ΄ν΄νκ³ μ§μν μ¬νμ λ°λΌ μΆλ ₯ν μ μλ λ°μ΄ν°λ₯Ό νμ΅νμμ΅λλ€.
|
33 |
+
- νμ΅ Instruction Datasets Format:
|
34 |
+
<pre><code>{"prompt": "prompt text", "completion": "ideal generated text"}</code></pre>
|
35 |
+
|
36 |
+
# βΈ μ¬μ© μ¬λ‘
|
37 |
+
ktdsbaseLM v0.11μ λ€μν μμ© λΆμΌμμ μ¬μ©λ μ μμ΅λλ€. μλ₯Ό λ€μ΄:
|
38 |
+
- κ΅μ‘ λΆμΌ: μμ¬, μν, κ³Όν λ± λ€μν νμ΅ μλ£μ λν μ§μμλ΅ λ° μ€λͺ
μμ±.
|
39 |
+
- λΉμ¦λμ€: λ²λ₯ , μ¬λ¬΄, μΈλ¬΄ κ΄λ ¨ μ§μμ λν λ΅λ³ μ 곡 λ° λ¬Έμ μμ½.
|
40 |
+
- μ°κ΅¬ λ° λ¬Έν: νκ΅ μ¬νμ λ¬Ένμ λ§μΆ μμ°μ΄ μ²λ¦¬ μμ
, κ°μ λΆμ, λ¬Έμ μμ± λ° λ²μ.
|
41 |
+
- κ³ κ° μλΉμ€: μ¬μ©μμμ λν μμ± λ° λ§μΆ€ν μλ΅ μ 곡.
|
42 |
+
- μ΄ λͺ¨λΈμ λ€μν μμ°μ΄ μ²λ¦¬ μμ
μμ λμ νμ©λλ₯Ό κ°μ§λλ€.
|
43 |
+
|
44 |
+
# βΉ νκ³ ββ
|
45 |
+
- ktdsBaseLM v0.11μ νκ΅μ΄μ νκ΅ λ¬Ένμ νΉνλμ΄ μμΌλ,
|
46 |
+
νΉμ μμ(μ: μ΅μ κ΅μ μλ£, μ λ¬Έ λΆμΌ)μ λ°μ΄ν° λΆμ‘±μΌλ‘ μΈν΄ λ€λ₯Έ μΈμ΄ λλ
|
47 |
+
λ¬Ένμ λν μλ΅μ μ νμ±μ΄ λ¨μ΄μ§ μ μμ΅λλ€.
|
48 |
+
λν, 볡μ‘ν λ
Όλ¦¬μ μ¬κ³ λ₯Ό μꡬνλ λ¬Έμ μ λν΄ μ νλ μΆλ‘ λ₯λ ₯μ λ³΄μΌ μ μμΌλ©°,
|
49 |
+
νΈν₯λ λ°μ΄ν°κ° ν¬ν¨λ κ²½μ° νΈν₯λ μλ΅μ΄ μμ±λ κ°λ₯μ±λ μ‘΄μ¬ν©λλ€.
|
50 |
+
|
51 |
+
# βΊ μ¬μ© λ°©λ²
|
52 |
+
<pre><code>
|
53 |
+
import os
|
54 |
+
import os.path as osp
|
55 |
+
import sys
|
56 |
+
import fire
|
57 |
+
import json
|
58 |
+
from typing import List, Union
|
59 |
+
import pandas as pd
|
60 |
+
import torch
|
61 |
+
from torch.nn import functional as F
|
62 |
+
|
63 |
+
import transformers
|
64 |
+
from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
|
65 |
+
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
|
66 |
+
from transformers import LlamaForCausalLM, LlamaTokenizer
|
67 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
68 |
+
|
69 |
+
from datasets import load_dataset
|
70 |
+
|
71 |
+
from peft import (
|
72 |
+
LoraConfig,
|
73 |
+
get_peft_model,
|
74 |
+
set_peft_model_state_dict
|
75 |
+
)
|
76 |
+
from peft import PeftModel
|
77 |
+
import re
|
78 |
+
import ast
|
79 |
+
|
80 |
+
device = 'auto' #@param {type: "string"}
|
81 |
+
model = '' #@param {type: "string"}
|
82 |
+
model = AutoModelForCausalLM.from_pretrained(
|
83 |
+
model,
|
84 |
+
quantization_config=bnb_config,
|
85 |
+
#load_in_4bit=True, # Quantization Load
|
86 |
+
device_map=device)
|
87 |
+
|
88 |
+
tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)
|
89 |
+
|
90 |
+
input_text = "μλ
νμΈμ."
|
91 |
+
inputs = tokenizer(input_text, return_tensors="pt")
|
92 |
+
inputs = inputs.to("cuda:0")
|
93 |
+
|
94 |
+
with torch.no_grad():
|
95 |
+
outputs = model.generate(**inputs, max_length=1024)
|
96 |
+
|
97 |
+
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
98 |
+
|
99 |
+
</code></pre>
|
100 |
+
|
101 |
+
## β
ktdsλ openchat μΈμλ LlaMA, Polyglot, EEVE λ± λνμ μΈ LLMμ λ€μν μμμ νκ΅μ λ¬Ένμ μ§μμ νμΈνλν LLMμ μ 곡ν μμ μ
λλ€.
|
102 |
+
---
|
103 |
+
Hereβs the English version of the provided text:
|
104 |
+
|
105 |
+
|
106 |
+
|
107 |
+
# βΆ Model Description
|
108 |
+
|
109 |
+
**Model Name and Key Features**:
|
110 |
+
KTDSbaseLM v0.11 is based on the OpenChat 3.5 model, fine-tuned using the SFT method on the Mistral 7B model.
|
111 |
+
It is designed to understand Korean and various cultural contexts, utilizing data from 135 domains in Korean society.
|
112 |
+
The model supports tasks such as text generation, conversation inference, document summarization,
|
113 |
+
question answering, sentiment analysis, and other NLP tasks.
|
114 |
+
Its applications span fields like law, finance, science, education, business, and cultural research.
|
115 |
+
|
116 |
+
**Model Architecture**:
|
117 |
+
KTDSBaseLM v0.11 is a high-performance language model with 7 billion parameters based on the Mistral 7B model.
|
118 |
+
It uses OpenChat 3.5 as the foundation and is fine-tuned using SFT to excel in Korean language and culture.
|
119 |
+
The streamlined Mistral 7B architecture ensures fast inference and memory efficiency,
|
120 |
+
optimized for various NLP tasks like text generation, question answering, document summarization, and sentiment analysis.
|
121 |
+
|
122 |
+
---
|
123 |
+
|
124 |
+
# β· Training Data
|
125 |
+
|
126 |
+
KTDSbaseLM v0.11 was trained on 3.6GB of data, comprising 2.33 million Q&A instances.
|
127 |
+
This includes 1.33 million multiple-choice questions across 53 domains such as history,
|
128 |
+
finance, law, tax, and science, trained with the Chain of Thought method. Additionally,
|
129 |
+
1.3 million short-answer questions cover 38 domains including history, finance, and law.
|
130 |
+
|
131 |
+
**Training Instruction Dataset Format**:
|
132 |
+
`{"prompt": "prompt text", "completion": "ideal generated text"}`
|
133 |
+
|
134 |
+
---
|
135 |
+
|
136 |
+
# βΈ Use Cases
|
137 |
+
|
138 |
+
KTDSbaseLM v0.11 can be used across multiple fields, such as:
|
139 |
+
|
140 |
+
- **Education**: Answering questions and generating explanations for subjects like history, math, and science.
|
141 |
+
- **Business**: Providing responses and summaries for legal, financial, and tax-related queries.
|
142 |
+
- **Research and Culture**: Performing NLP tasks, sentiment analysis, document generation, and translation.
|
143 |
+
- **Customer Service**: Generating conversations and personalized responses for users.
|
144 |
+
|
145 |
+
This model is highly versatile in various NLP tasks.
|
146 |
+
|
147 |
+
---
|
148 |
+
|
149 |
+
# βΉ Limitations
|
150 |
+
|
151 |
+
KTDSBaseLM v0.11 is specialized in Korean language and culture.
|
152 |
+
However, it may lack accuracy in responding to topics outside its scope,
|
153 |
+
such as international or specialized data.
|
154 |
+
Additionally, it may have limited reasoning ability for complex logical problems and
|
155 |
+
may produce biased responses if trained on biased data.
|
156 |
+
|
157 |
+
---
|
158 |
+
|
159 |
+
# βΊ Usage Instructions
|
160 |
+
<pre><code>
|
161 |
+
import os
|
162 |
+
import os.path as osp
|
163 |
+
import sys
|
164 |
+
import fire
|
165 |
+
import json
|
166 |
+
from typing import List, Union
|
167 |
+
import pandas as pd
|
168 |
+
import torch
|
169 |
+
from torch.nn import functional as F
|
170 |
+
|
171 |
+
import transformers
|
172 |
+
from transformers import TrainerCallback, TrainingArguments, TrainerState, TrainerControl, BitsAndBytesConfig
|
173 |
+
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
|
174 |
+
from transformers import LlamaForCausalLM, LlamaTokenizer
|
175 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
176 |
+
|
177 |
+
from datasets import load_dataset
|
178 |
+
|
179 |
+
from peft import (
|
180 |
+
LoraConfig,
|
181 |
+
get_peft_model,
|
182 |
+
set_peft_model_state_dict
|
183 |
+
)
|
184 |
+
from peft import PeftModel
|
185 |
+
import re
|
186 |
+
import ast
|
187 |
+
|
188 |
+
device = 'auto' #@param {type: "string"}
|
189 |
+
model = '' #@param {type: "string"}
|
190 |
+
model = AutoModelForCausalLM.from_pretrained(
|
191 |
+
model,
|
192 |
+
quantization_config=bnb_config,
|
193 |
+
#load_in_4bit=True, # Quantization Load
|
194 |
+
device_map=device)
|
195 |
+
|
196 |
+
tokenizer = AutoTokenizer.from_pretrained(base_LLM_model)
|
197 |
+
|
198 |
+
input_text = "μλ
νμΈμ."
|
199 |
+
inputs = tokenizer(input_text, return_tensors="pt")
|
200 |
+
inputs = inputs.to("cuda:0")
|
201 |
+
|
202 |
+
with torch.no_grad():
|
203 |
+
outputs = model.generate(**inputs, max_length=1024)
|
204 |
+
|
205 |
+
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
206 |
+
</code></pre>
|
207 |
+
|
208 |
+
## KTDS plans to provide fine-tuned LLMs (Large Language Models) across various domains of Korean culture and knowledge,
|
209 |
+
## including models based on not only OpenChat but also LLaMA, Polyglot, and EEVE.
|
210 |
+
## These models will be tailored to better understand and generate content specific to Korean contexts.
|