KULLM-RLHF / README.md
Trofish's picture
Update README.md
075db61 verified
2023 ์„ฑ๊ท ๊ด€๋Œ€ ํ•˜๊ณ„์ง‘์ค‘ ์‚ฐํ•™ํ˜‘๋ ฅํ”„๋กœ์ ํŠธ VAIV
## GPT ๊ธฐ๋ฐ˜์˜ ์ž์—ฐ์Šค๋Ÿฝ๊ณ (Friendly) ์œค๋ฆฌ์ ์ธ(Harmless) ์ผ์ƒ ๋Œ€ํ™”ํ˜• ์ฑ—๋ด‡ ๋ชจ๋ธ
### Github : https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM
# ์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋ชฉ์ 
GPT-NEOX(Polyglot-ko) ๊ธฐ๋ฐ˜ ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ์œค๋ฆฌ์ ์ธ ํ•œ๊ตญ์–ด ๊ธฐ๋ฐ˜ ์ผ์ƒ ๋Œ€ํ™”ํ˜• ์ฑ—๋ด‡ ๋ชจ๋ธ ๊ตฌํ˜„
![image](https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM/assets/79634774/18bb1ab4-8924-4b43-b538-1e6529297217)
# ๊ฐœ๋ฐœ ๋‚ด์šฉ
- Self-Instruct: GPT4๋ฅผ ์ด์šฉํ•œ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ•
- RLHF(Reinforcement Learning from Human Feedback): ์‚ฌ๋žŒ์˜ ์„ ํ˜ธ๋„๋ฅผ ๋ฐ˜์˜ํ•œ ๊ฐ•ํ™”ํ•™์Šต
- DeepSpeed: ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ๋”ฅ๋Ÿฌ๋‹์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ๊ธฐ์ˆ 
- Task 1: ๊ฐ•ํ™”ํ•™์Šต ๋‹จ๊ณ„๋ณ„ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•
- Task 2: SFT ๋ชจ๋ธ Instruction-tuning
- Task 3: Reward ๋ชจ๋ธ ver1,2,3 ๊ตฌํ˜„
- Task 4: RLHF์™€ DeepSpeedChat์„ ํ†ตํ•œ ์ตœ์ข… ๋ชจ๋ธ ๊ตฌํ˜„ (https://huggingface.co/Trofish/KULLM-RLHF)
# Task1. ๊ฐ•ํ™”ํ•™์Šต ๋‹จ๊ณ„๋ณ„ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•
![image](https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM/assets/79634774/4bb56e36-0c49-4d15-a2c6-2824867419a8)
![Screenshot 2024-06-18 at 11 05 55โ€ฏAM](https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM/assets/79634774/2f637065-fa25-4402-b319-113ff4c6e1a9)
![Screenshot 2024-06-18 at 11 06 08โ€ฏAM](https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM/assets/79634774/2a6c2e9b-1292-43b9-b5e7-5ced3643988d)
# Task2. SFT ๋ชจ๋ธ Fine-tuning
## Baseline Model
[- ๊ณ ๋ ค๋Œ€ํ•™๊ต NLP & AI ์—ฐ๊ตฌ์‹ค๊ณผ HIAI ์—ฐ๊ตฌ์†Œ๊ฐ€ ๊ฐœ๋ฐœํ•œ ํ•œ๊ตญ์–ด LLM **"KULLM"** ์‚ฌ์šฉ](https://github.com/nlpai-lab/KULLM)
## Datasets
![image](https://github.com/VAIV-2023/VAIV2023/assets/79634774/085610db-3714-43c3-855b-58baad2f4e8b)
## SFT Model Finetuning
![image](https://github.com/VAIV-2023/VAIV2023/assets/79634774/0f5e36fa-20a8-43f9-bd03-5f8224d5e9d0)
* ๋ชจ๋ธํ•™์Šต์—๋Š” Google Colab์—์„œ ์ œ๊ณตํ•˜๋Š” A100 40GB GPU ์‚ฌ์šฉ
## SFT Model Evaluation
![image](https://github.com/VAIV-2023/VAIV2023/assets/79634774/9fe9e5aa-6dc7-4c7b-8529-45e0a75db9c6)
![image](https://github.com/VAIV-2023/VAIV2023/assets/79634774/a994a960-db7c-4e75-a11a-d7755d372722)
* G-Eval: https://arxiv.org/abs/2303.16634
# Task3-1. Reward Model ver1 ๊ตฌํ˜„
## Baseline Model
- EleutherAI์—์„œ ๊ฐœ๋ฐœํ•œ ์ดˆ๊ฑฐ๋Œ€ ํ•œ๊ตญ์–ด ์–ธ์–ด ๋ชจ๋ธ **Polyglot-Ko** ์‚ฌ์šฉ
- 1.3b ๋ชจ๋ธ๊ณผ 5.8b ๋ชจ๋ธ์„ ๊ฐ๊ฐ ์‹คํ—˜
## Datasets
![image](https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM/assets/79634774/0082da9b-b0b8-4089-8647-cffa5ce724fb)
- InstructGPT์˜ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ• ๋ฐฉ๋ฒ•
- Reward ๋ชจ๋ธ ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ SFT ํ•™์Šต์— ์‚ฌ์šฉํ•œ prompt(1,500๊ฐœ - ์ผ์ƒ๋Œ€ํ™”:ํ˜์˜คํ‘œํ˜„=2:1)์™€ ์ƒˆ๋กœ์šด prompt(1,000๊ฐœ - DeepSpeedChat ๋ฒˆ์—ญ ๋ฐ์ดํ„ฐ์…‹) ์‚ฌ์šฉ
- SFT ๋ชจ๋ธ์—์„œ ํ•œ๊ฐœ์˜ prompt๋‹น K๊ฐœ์˜ Response๋ฅผ ์ƒ์„ฑํ•˜๊ณ , ์ˆœ์œ„๋ฅผ Labeling
- ๋ฐ์ดํ„ฐ์…‹ ๋ผ๋ฒจ๋ง
- Instruct GPT์˜ ๊ฒฝ์šฐ ์‚ฌ๋žŒ์ด ์ง์ ‘ Labeling์„ ํ•˜์—ฟ์ง€๋งŒ, ์ผ๊ด€๋œ ํ‰๊ฐ€์™€ ์‹œ๊ฐ„ ๋‹จ์ถ•์„ ์œ„ํ•ด GPt-4์™€ G-Eval์„ ์ด์šฉ
- SFT์—์„œ ์ƒ์„ฑํ•œ ๋‘ Response ์ค‘ G-Eval ํ‰๊ฐ€ ์ ์ˆ˜ ํ•ฉ์ด ๋†’์€ ๊ฒƒ์„ Chosen response๋กœ ๊ฒฐ์ •
- ๋ฐ์ดํ„ฐ์…‹ ์œ ํ˜•๋ณ„๋กœ G-Eval ํ‰๊ฐ€ Prompt์— ์ฐจ์ด๋ฅผ ๋‘์—ˆ์Œ
- ![image](https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM/assets/79634774/7d7117d0-02e9-42dd-8ce3-5244cf726bf8)
## Reward v1 Model Finetuning
![image](https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM/assets/79634774/da4d9b15-ec91-44bb-84d9-f28aeffd16ad)
- InstructGPT ๋…ผ๋ฌธ์— ๋”ฐ๋ฅด๋ฉด, Reward ๋ชจ๋ธ์€ overfitting๋˜๋ฉด ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ €ํ•˜๋œ๋‹ค๊ณ  ํ•จ --> epoch ์ˆ˜๋ฅผ 1๋กœ ์„ค์ •
- batch size๋‚˜ learning rate ๋“ฑ ๋‹ค๋ฅธ hyper-parameter๋Š” ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์ด ์—†๋‹ค๊ณ  ํ•จ
- Colab A100 40GB ๊ธฐ์ค€ ์ด ํ•™์Šต ์‹œ๊ฐ„ 4๋ถ„
## Reward v1 Model Evaluation
![image](https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM/assets/79634774/c21be612-b26d-4a1c-a1e2-6a99442660da)
- Reward Model Template
- "์•„๋ž˜๋Š” ์ž‘์—…์„ ์„ค๋ช…ํ•˜๋Š” ๋ช…๋ น์–ด์ž…๋‹ˆ๋‹ค. ์š”์ฒญ์„ ์ ์ ˆํžˆ ์™„๋ฃŒํ•˜๋Š” ์‘๋‹ต์„ ์ž‘์„ฑํ•˜์„ธ์š”. \n\n ### ๋ช…๋ น์–ด:\n{prompt}\n\n ### ์‘๋‹ต:\n"
# Task3-2. Reward Model ver2 ๊ตฌํ˜„
## Reward Model ver1 Issues
- ๊ตฌํ˜„๋œ Reward Model์˜ ์„ฑ๋Šฅ์ด ์ข‹์ง€ ์•Š์Œ (Accuracy 0.65)
- Reward Model ver1์„ ์‚ฌ์šฉํ•˜์—ฌ Step3 ํ•™์Šต์‹œ ํ˜์˜คํ‘œํ˜„์ด ์•„๋‹Œ๋ฐ๋„ ํ˜์˜คํ‘œํ˜„์ด๋ผ๊ณ  ์ธ์‹ํ•˜๊ณ  ๋‹ต๋ณ€ํ•˜๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ
## Issue ํ•ด๊ฒฐ๋ฐฉ์•ˆ
![image](https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM/assets/79634774/6f4f0665-a8c7-4903-a626-f37018b7e4c9)
- SFT ๋ชจ๋ธ๋กœ ๋‹ต๋ณ€์„ 2๊ฐœ ์ƒ์„ฑํ•˜์˜€์„ ๋•Œ(Ver1), Chosen, Rejected ๋‹ต๋ณ€์˜ ์ฐจ์ด๊ฐ€ ํฌ๊ฒŒ ์—†์–ด ๋ชจ๋ธ์ด ํ•™์Šต๋˜์ง€ ์•Š๋Š” ํ˜„์ƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•˜์—ฌ 2๊ฐœ์˜ ๋ชจ๋ธ **(ChatGPT, SFT)**๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ต๋ณ€์„ ์ƒ์„ฑ(Ver2)
- General Task ๋‹ต๋ณ€์— ๋Œ€ํ•œ ํ‰๊ฐ€ ์„ฑ๋Šฅ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด Evol-instruct ๋ฐ์ดํ„ฐ ์ถ”๊ฐ€
- ํ•™์Šต์— ์‚ฌ์šฉํ•œ ๋ชจ๋“  ๋ฐ์ดํ„ฐ์…‹์€ 15 token ์ดํ•˜, cosine ์œ ์‚ฌ๋„ 0.5 ์ด์ƒ์ผ ๊ฒฝ์šฐ ์ œ๊ฑฐํ•˜๋Š” Filtering ์ž‘์—… ์ˆ˜ํ–‰
- ํ˜์˜คํ‘œํ˜„ ํ•™์Šต์‹œ(Ver1) Step3 ๊ฐ•ํ™”ํ•™์Šต ์ดํ›„์— ๋‹ต๋ณ€์ด ์ด์ƒํ•˜๊ฒŒ ์ƒ์„ฑ๋˜๋Š” Issue๊ฐ€ ์žˆ์–ด, ํ˜์˜คํ‘œํ˜„์„ ๋ฐ์ดํ„ฐ๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ํ•™์Šต(Ver2)
- RM-ver1์€ GPT4๊ฐ€ Chosen, Rejected ๋ ˆ์ด๋ธ”๋ง์„ ์ง„ํ–‰ํ•˜์˜€์ง€๋งŒ, Resource ์ด์Šˆ๋กœ ์ธํ•ด ์ผ๋ถ€๋งŒ ์‚ฌ๋žŒ์ด ๋ผ๋ฒจ๋ง ์ง„ํ–‰
- ์ผ์ƒ๋Œ€ํ™” ๋ฐ์ดํ„ฐ์…‹
- ChatGPT์™€ SFT ๋ชจ๋‘ ์ผ๊ด€๋˜๊ฒŒ ๋†’์€ ํ€„๋ฆฌํ‹ฐ์˜ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜์ง€ ์•Š์•„, ์‚ฌ๋žŒ์ด ์ง์ ‘ ๋ผ๋ฒจ๋ง ์ง„ํ–‰
- RLHF ํ•œ๊ตญ์–ด ๋ฒˆ์—ญ, Evol-Instruct ๋ฐ์ดํ„ฐ์…‹
- ChatGPT๊ฐ€ ์ผ๊ด€๋˜๊ฒŒ ๋†’์€ ํ€„๋ฆฌํ‹ฐ์˜ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜์—ฌ ChatGPT๋ฅผ Chosen, SFT๋ฅผ Rejected๋กœ ๋ผ๋ฒจ๋ง ์ง„ํ–‰
## Reward Model ver2 Evaluation
![image](https://github.com/VAIV-2023/RLHF-Korean-Friendly-LLM/assets/79634774/834cb645-7909-464b-b072-635aaac8eeff)
# Task4. RLHF์™€ DeepSpeedChat์„ ํ†ตํ•œ ์ตœ์ข… ๋ชจ๋ธ ๊ตฌํ˜„
- Microsoft์—์„œ ๋งŒ๋“  ๋Œ€๊ทœ๋ชจ ๋ถ„์‚ฐ ๋”ฅ๋Ÿฌ๋‹์„ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™” ๊ธฐ์ˆ (DeepSpeed)์„ RLHF Process์— ์ ์šฉํ•œ DeepSpeedChat ์‚ฌ์šฉ
- Human preference๋กœ ํ•™์Šต์„ ์‹œํ‚จ Reward ๋ชจ๋ธ๊ณผ ๊ฐ•ํ™”ํ•™์Šต์„ ํ†ตํ•ด SFT ๋ชจ๋ธ์— ์‚ฌ๋žŒ์˜ ์„ ํ˜ธ๋„๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ์ž์—ฐ์Šค๋Ÿฝ๊ณ (FRIENDLY), ์œค๋ฆฌ์ ์ธ (HARMLESS)ย ์ฑ—๋ด‡ ์ƒ์„ฑ
## Baseline Models
- Actor Model: KULLM-SFT-V2
- Reward Model: Polyglot-Ko-Reward-V3
## Training Options
![image](https://github.com/VAIV-2023/VAIV2023/assets/79634774/ae2cdfe5-7552-4009-a99a-244e79d945dc)
## RLHF Training
![image](https://github.com/VAIV-2023/VAIV2023/assets/79634774/3d4dbf68-5222-4f6a-a6d0-87ea176c5211)
- ํ•™์Šต ๊ฒฐ๊ณผ, SFT ๋ชจ๋ธ์˜ ๋‹ต๋ณ€์— ๋Œ€ํ•œ ํ€„๋ฆฌํ‹ฐ์ธ Reward๊ฐ€ ์ƒ์Šนํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธ (์‚ฌ๋žŒ์˜ ์„ ํ˜ธ๋„๊ฐ€ ๋†’์€ ๋‹ต๋ณ€์„ ์ƒ์„ฑ)
## RLFH Model Evaluation
![image](https://github.com/VAIV-2023/VAIV2023/assets/79634774/2b58ed3a-7ed5-4e60-ba4b-c9b291b1fdff)
![image](https://github.com/VAIV-2023/VAIV2023/assets/79634774/75b2a1ee-d7c0-4ba9-ab2f-727abab644e9)
## Final RLHF Model
- https://huggingface.co/Trofish/KULLM-RLHF
# Contributors ๐Ÿ™Œ
- ๋ฐ•์„ฑ์™„ (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 20ํ•™๋ฒˆ, [email protected])
- ์†กํ˜„๋นˆ (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 20ํ•™๋ฒˆ, [email protected])
- ํ—ˆ์œ ๋ฏผ (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 21ํ•™๋ฒˆ, [email protected])
- ํ™์—ฌ์› (์„ฑ๊ท ๊ด€๋Œ€ํ•™๊ต ์†Œํ”„ํŠธ์›จ์–ดํ•™๊ณผ 20ํ•™๋ฒˆ, [email protected])