|
--- |
|
language: |
|
- en |
|
license: mit |
|
library_name: transformers |
|
tags: |
|
- code |
|
base_model: |
|
- google/gemma-1.1-2b-it |
|
datasets: |
|
- kreimben/leetcode_with_youtube_captions |
|
- kreimben/leetcode_user_submissions |
|
widget: |
|
- text: explain about two sum problem. from brute force approach to the most advanced |
|
algorithms. |
|
example_title: two sum example |
|
- text: explain about leetcode 72 edit distance. i don't get even the approach. |
|
example_title: edit distance example |
|
- text: explain about leetcode 139 Word Break. please give me the approach. |
|
example_title: word break example |
|
inference: |
|
parameters: |
|
max_new_tokens: 250 |
|
temperature: 0.3 |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# CodeMind |
|
|
|
## ์๊ฐ |
|
์ฝ๋ฉ ํ
์คํธ ๋ฌธ์ ํด๊ฒฐ ๋ฐ ํ์ต ๋ณด์กฐ๋ฅผ ์ง์ํด ์ฃผ๋ ์ธ์ด ๋ชจ๋ธ์
๋๋ค. Leetcode ํด์ค ์์ ์๋ง ๋ฐ ์ ์ ๋ค์ ํฌ์คํ
๊ธ์ ์ด์ฉํด ํ์ธํ๋ํ์ฌ ์ฝ๋ฉ ํ
์คํธ์ ์กฐ๊ธ ๋ ํนํ๋ ๋ต์์ ์ ์ํด ์ค ์ ์๊ฒ ํ์์ต๋๋ค. |
|
|
|
## ๋ชจ๋ธ ์ธ๋ถ ์ ๋ณด |
|
- **๋ชจ๋ธ ์ด๋ฆ**: CodeMind |
|
- **๊ธฐ๋ณธ ๋ชจ๋ธ**: google/gemma-1.1-2b-it |
|
- **ํ๋ จ ์ธ์ด**: ์์ด |
|
- **๋ชจ๋ธ ํฌ๊ธฐ**: 2.51B ํ๋ผ๋ฏธํฐ |
|
|
|
## ํ์ ๊ตฌ์ฑ |
|
- NLP 3๋ช
|
|
- SRE 2๋ช
|
|
|
|
## ์ฃผ์ ๊ธฐ๋ฅ |
|
- ๋ฌธ์ ์ ํ ๋ฐ ์ ๊ทผ๋ฒ ์ค๋ช
|
|
- ์ ๋ต ์ฝ๋ ์์ฑ |
|
|
|
## ํ๋ จ ๋ฐ์ดํฐ |
|
- [**LeetCode ์ฌ์ฉ์ ์ ์ถ๋ฌผ**](https://huggingface.co/datasets/kreimben/leetcode_user_submissions): ๋ค์ํ ์๊ณ ๋ฆฌ์ฆ ๋ฌธ์ ์ ํ์ด์ฌ ์๋ฃจ์
|
|
- [**์ ํ๋ธ ์บก์
**](https://huggingface.co/datasets/kreimben/leetcode_with_youtube_captions): LeetCode ๋ฌธ์ ์ ๋ํ ์ค๋ช
๋ฐ ๋จ๊ณ๋ณ ๊ฐ์ด๋ |
|
|
|
## ์ฌ์ฉ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ |
|
- [transformers](https://github.com/huggingface/transformers): ์์ฐ์ด ์ฒ๋ฆฌ ๋ชจ๋ธ์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ |
|
- [datasets](https://github.com/huggingface/datasets): ๋ฐ์ดํฐ์
์ฒ๋ฆฌ ๋ฐ ๊ด๋ฆฌ ๋ผ์ด๋ธ๋ฌ๋ฆฌ |
|
- [bitsandbytes](https://github.com/TimDettmers/bitsandbytes): ์ต์ ํ๋ ์ฐ์ฐ์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ |
|
- [peft](https://github.com/huggingface/peft): ํ์ธ ํ๋์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ |
|
- [trl](https://github.com/huggingface/trl): ์ธ์ด ๋ชจ๋ธ ํ๋์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ |
|
- [pandas](https://github.com/pandas-dev/pandas): ๋ฐ์ดํฐ ์กฐ์์ ์ํ ๋ผ์ด๋ธ๋ฌ๋ฆฌ |
|
|
|
## ํ์ผ ๊ตฌ์กฐ |
|
- **dataset/**: ๋ฐ์ดํฐ์
ํ์ผ์ ํฌํจํฉ๋๋ค. |
|
- **eval/**: ํ๊ฐ ์คํฌ๋ฆฝํธ๋ฅผ ํฌํจํฉ๋๋ค. |
|
- **fine-tuning/**: fine tuning ๊ด๋ จ ๋
ธํธ๋ถ ๋ฐ ์คํฌ๋ฆฝํธ๋ฅผ ํฌํจํฉ๋๋ค. |
|
- `gemma-1.1-2b-it peft qlora.ipynb`: fine tuning ๊ณผ์ ์ ๋ํ ์ธ๋ถ ์ฌํญ์ด ํฌํจ๋ ๋
ธํธ๋ถ์
๋๋ค. |
|
- **demo.ipynb**: ๋ฐ๋ชจ ๋
ธํธ๋ถ์ผ๋ก ๋ชจ๋ธ ์ฌ์ฉ ์์ ๊ฐ ํฌํจ๋์ด ์์ต๋๋ค. |
|
- **requirements.txt**: ํ๋ก์ ํธ ์์กด์ฑ ๋ชฉ๋ก์ด ํฌํจ๋์ด ์์ต๋๋ค. |
|
- **utils.py**: ์ ํธ๋ฆฌํฐ ํจ์๋ค์ด ํฌํจ๋์ด ์์ต๋๋ค. |
|
|
|
## ์ฌ์ฉ ๋ฐฉ๋ฒ |
|
์ด ๋ชจ๋ธ์ HuggingFace์ ๋ชจ๋ธ ํ๋ธ๋ฅผ ํตํด ์ก์ธ์คํ ์ ์์ผ๋ฉฐ, API๋ฅผ ์ฌ์ฉํ์ฌ ์์ฉ ํ๋ก๊ทธ๋จ์ ํตํฉํ ์ ์์ต๋๋ค. ์ฝ๋ฉ ๋ฌธ์ ๋๋ ํ๋ก๊ทธ๋๋ฐ ๊ด๋ จ ์ง๋ฌธ์ ์ ๊ณตํ๋ฉด ๋ชจ๋ธ์ด ๊ด๋ จ ์ค๋ช
, ์ฝ๋ ์ค๋ํซ ๋๋ ๊ฐ์ด๋๋ฅผ ์์ฑํฉ๋๋ค. |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("kreimben/CodeMind-gemma-2b") |
|
model = AutoModelForCausalLM.from_pretrained("kreimben/CodeMind-gemma-2b") |
|
|
|
inputs = tokenizer("์ฝ๋ฉ ๋ฌธ์ ๋ ์ง๋ฌธ์ ์ฌ๊ธฐ์ ์
๋ ฅํ์ธ์", return_tensors="pt") |
|
outputs = model.generate(inputs.input_ids) |
|
print(tokenizer.decode(outputs[0])) |
|
``` |
|
|
|
## ํ๋ จ ๊ณผ์ |
|
|
|
### ๋ชจ๋ธ ๋ฐ ํ ํฌ๋์ด์ ๋ก๋ |
|
```python |
|
import os |
|
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
|
|
|
bnb_config = BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_quant_type="nf4", |
|
bnb_4bit_compute_dtype=torch.bfloat16 |
|
) |
|
|
|
model_id = 'google/gemma-1.1-2b-it' |
|
token = os.getenv('HF_READ') |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"": 0}, token=token) |
|
model.config.use_cache = False |
|
model.gradient_checkpointing_enable() |
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
tokenizer.padding_side = 'right' |
|
tokenizer.pad_token = tokenizer.eos_token |
|
``` |
|
|
|
### LoRA ๊ตฌ์ฑ ๋ฐ ๋ชจ๋ธ ์ค๋น |
|
```python |
|
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training |
|
import bitsandbytes as bnb |
|
|
|
model = prepare_model_for_kbit_training(model) |
|
|
|
def find_all_linear_names(model): |
|
cls = bnb.nn.Linear4bit |
|
lora_module_names = set() |
|
for name, module in model.named_modules(): |
|
if isinstance(module, cls): |
|
names = name.split('.') |
|
lora_module_names.add(names[0] if len(names) == 1 else names[-1]) |
|
if 'lm_head' in lora_module_names: |
|
lora_module_names.remove('lm_head') |
|
return list(lora_module_names) |
|
|
|
modules = find_all_linear_names(model) |
|
lora_config = LoraConfig( |
|
r=64, |
|
lora_alpha=32, |
|
target_modules=modules, |
|
lora_dropout=0.05, |
|
bias="none", |
|
task_type="CAUSAL_LM" |
|
) |
|
|
|
model = get_peft_model(model, lora_config) |
|
``` |
|
|
|
### ๋ฐ์ดํฐ ์ค๋น |
|
```python |
|
import pandas as pd |
|
from datasets import Dataset |
|
|
|
submission_dataset = datasets.load_dataset('kreimben/leetcode_user_submissions_only_python', split='train').to_pandas() |
|
submission_dataset = submission_dataset[['title', 'question_hints', 'question_content', 'content']] |
|
captions_dataset = datasets.load_dataset('kreimben/leetcode_with_youtube_captions', split='train').to_pandas() |
|
captions_dataset = captions_dataset[['title', 'question_hints', 'question_content', 'cc_content']] |
|
captions_dataset.rename(columns={'cc_content': 'content'}, inplace=True) |
|
|
|
dataset = pd.concat([submission_dataset, captions_dataset]) |
|
del submission_dataset, captions_dataset |
|
|
|
dataset = Dataset.from_pandas(dataset) |
|
GEMMA_2B_IT_MODEL_PREFIX_TEXT = "Below is an coding test problem. Solve the question." |
|
|
|
def generate_prompt(data_point): |
|
return f"<bos><start_of_turn>user {GEMMA_2B_IT_MODEL_PREFIX_TEXT} |
|
|
|
I don't know {data_point['title']} problem. give me the insight or appoach. |
|
|
|
this is problem's hint. |
|
{data_point['question_hints']} |
|
|
|
here are some content of question. |
|
{data_point['question_content']}<end_of_turn> |
|
<start_of_turn>model {data_point['content']}<end_of_turn><eos>" |
|
|
|
text_column = [generate_prompt(data_point) for data_point in dataset] |
|
dataset = dataset.add_column("prompt", text_column) |
|
``` |
|
|
|
### ํ๋ จ |
|
```python |
|
from trl import SFTTrainer |
|
import transformers |
|
import torch |
|
|
|
tokenizer.pad_token = tokenizer.eos_token |
|
torch.cuda.empty_cache() |
|
|
|
trainer = SFTTrainer( |
|
model=model, |
|
train_dataset=dataset, |
|
dataset_text_field="prompt", |
|
peft_config=lora_config, |
|
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False), |
|
args=transformers.TrainingArguments( |
|
output_dir='out', |
|
bf16=True, |
|
max_steps=100, |
|
warmup_steps=50, |
|
per_device_train_batch_size=1, |
|
gradient_accumulation_steps=1, |
|
optim="paged_adamw_8bit", |
|
logging_steps=20, |
|
report_to='wandb', |
|
), |
|
) |
|
|
|
trainer.train() |
|
``` |
|
|
|
## ํ๊ฐ |
|
๋ชจ๋ธ์ ์ฑ๋ฅ์ ๋ค์๊ณผ ๊ฐ์ด ํ๊ฐ๋์์ต๋๋ค: |
|
|
|
| Metric | Value | |
|
|--------------|--------| |
|
| Average | 41.62 | |
|
| ARC | 41.81 | |
|
| HellaSwag | 59.03 | |
|
| MMLU | 37.26 | |
|
| TruthfulQA | 43.45 | |
|
| Winogrande | 59.91 | |
|
| GSM8K | 8.26 | |
|
|
|
## ์ ํ ์ฌํญ ๋ฐ ์ค๋ฆฌ์ ๊ณ ๋ ค์ฌํญ |
|
- ๋ชจ๋ธ์ ์ถ๋ ฅ์ ํ์ต ๋ฐ์ดํฐ์ ๊ธฐ๋ฐํ๋ฏ๋ก ํญ์ ์ ํํ์ง ์์ ์ ์์ต๋๋ค. |
|
- ์ค์ํ ๊ฒฐ์ ์ด๋ ์ค์ธ๊ณ ๋ฌธ์ ํด๊ฒฐ์ ๋ชจ๋ธ ์ถ๋ ฅ์ ์ฌ์ฉํ๊ธฐ ์ ์ ๋ฐ๋์ ๊ฒ์ฆ์ด ํ์ํฉ๋๋ค. |
|
|