devngho/llama-ablation-large-korean-corpus_edu

Llama 아키텍쳐로 pretrain된 모델입니다. 약 20.7B 토큰으로 약 34.5에포크 학습했습니다. MaxText를 통해 학습되었습니다.

500step마다 체크포인트가 제공됩니다.

이 연구는 Google의 TPU Research Cloud (TRC)의 Cloud TPU 제공으로 수행되었습니다. ⚡

예시

굵은 부분이 입력입니다.

max_new_tokens: 500

예시 1 <s> 인공지능은 2015년에 전 세계에서 가장 빠른 속도로 발전하고 있다. 2015년에 100억 개가 넘는 인공지능 로봇이 개발되고 2020년에는 100억 개가 넘는 인공지능 로봇이 개발될 것으로 예상된다. 2020년에는 100억 개가 넘는 인공지능 로봇이 개발될 것으로 전망된다. 2020년에는 100억 개가 넘는 인공지능 로봇이 개발될 것으로 예상된다.</s>

예시 2 <s> 한글의 특징은 한글이 만들어지기 이전의 문자였다는 것이다. 한글은 1443년에 창제된 훈민정음의 창제 이후부터 1907년에 이르기까지 약 250년 동안에 걸쳐서 만들어졌다. 한글은 1443년에 창제된 훈민정음의 반포와 함께 1446년에 반포된 훈민정음, 곧 한글의 기원을의 소릿값으로 보고 있다. 한글은 1443년에 창제된 훈민정음의 창제 이후부터 1517년에 반포된 훈민정음, 곧 한글의 기원을의 소릿값으로 보고 있다.</s>

예시 3 <s> 커피는 17세기경부터 유럽 각국에서 커피를 마셨고, 18세기 말에는 영국과 프랑스에서 커피를 마시게 되었다. 19세기 초에는 영국에서 커피가 대량으로 수입되었다. 19세기 초에는 영국에서 커피가 대량으로 수입되었다. 19세기 초에는 영국에서 커피가 대량으로 수입되었다.</s>

상당한 환각과 어색함, 반복이 있습니다.

상세

제작: devngho
언어: ko
라이선스: mit

학습 상세

learning_rate: 6e-4 (cosine, initial/end 6e-5)
warmup_ratio: 0.05
batch_size: 1024(fsdp 16 * per device 8 * ga 8)
optimizer: adamw(b1=0.9, b2=0.95, eps=1e-5, weight_decay=0.01)
duration: about 27h 50m
steps: 10000
wandb에서 전체 설정과 결과를 볼 수 있습니다.

학습 장비

TPU v4-32

학습 데이터셋

AI Hub, 모두의말뭉치를 dedup, length filtering 후 devngho/ko_edu_classifier_v2_nlpai-lab_KoE5로 평가했을 때 3점 이상인 데이터(약 8%, 1,354,234행) 사용

AI Hub, 모두의말뭉치 규정으로 인해 데이터셋을 공개할 수 없지만, 원본 데이터를 준비한다면 devngho/dataset-preprocess의 과정으로 동일하게 전처리할 수 있습니다. 분류기 필터링은 따로 수행해야 합니다.

소프트웨어

jax==0.4.35

MaxText를 포크한 devngho/MaxText

학습 결과

learning/loss: 1.6112642288208008
eval/avg_loss: 2.0766192864296023

아래에 벤치마크 결과가 제공됩니다.

devngho/llama-ablation-large-korean-corpus_edu

Pretrained using Llama architecture. Trained with about 20.7B tokens(approximately 34.5 epoch), using MaxText.

Checkpoints for every 500 steps are available.

This research was supported with Cloud TPUs from Google's TPU Research Cloud (TRC). ⚡

Details

Made by: devngho
Language: ko
License: mit

Training details

learning_rate: 6e-4 (cosine, initial/end 6e-5)
warmup_ratio: 0.05
batch_size: 1024(fsdp 16 * per device 8 * ga 8)
optimizer: adamw(b1=0.9, b2=0.95, eps=1e-5, weight_decay=0.01)
duration: about 27h 50m
steps: 10000
You can check all the configs and training results on wandb

Training devices

TPU v4-32

Training datasets

I applied deduplication and length filtering to a corpus from AI Hub and Modu Corpus, then filtered data (about 8%, 1,354,234 rows) that is that is >=3 points when evaluated using devngho/ko_edu_classifier_v2_nlpai-lab_KoE5.

I couldn't make the training dataset public because of the terms of AI Hub and Modu Corpus. You can still preprocess the dataset in the same way as the dataset used during training this model using devngho/dataset-preprocess with the raw datas. You still have to apply filtering using the edu classifier apart from the preprocessing.

Software

jax==0.4.35

devngho/MaxText, a fork of MaxText

Training results

learning/loss: 1.6112642288208008
eval/avg_loss: 2.0766192864296023

devngho
/

llama-ablation-large-korean-corpus_edu

You need to agree to share your contact information to access this model

devngho/llama-ablation-large-korean-corpus_edu

예시

상세

학습 상세

학습 장비

학습 데이터셋

소프트웨어

학습 결과

devngho/llama-ablation-large-korean-corpus_edu

Details

Training details

Training devices

Training datasets

Software

Training results

Collection including devngho/llama-ablation-large-korean-corpus_edu

Ablation models