yokhal-md / README.md

Update README.md

e7561d0 verified 26 days ago

4.32 kB

	---
	language:
	- ko
	- en
	library_name: transformers
	tags:
	- trl
	- sft
	widget:
	- text: 안녕
	---

	# Yokhal (욕쟁이 할머니)

	<!-- Provide a quick summary of what the model is/does. -->
	Korean Chatbot based on Google Gemma


	## Model Details

	### Model Description

	<!-- Provide a longer summary of what this model is. -->


	- Fine-tuned by: Seonglae Cho
	- Model type: Gemma
	- Language(s) (NLP): Korean, English
	- Finetuned from model: [Gemma-2b-it](https://huggingface.co/google/gemma-2b-it)

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/seonglae/yokhal
	- Demo: https://huggingface.co/spaces/seonglae/yokhal

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	### Direct Use

	Korean Chatbot with Internet culture

	### Recommendations

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	```py

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16,
	device_map="auto" if device is None else device,
	attn_implementation="flash_attention_2") # if flash enabled
	sys_prompt = '한국어로 대답해'
	texts = ['안녕', '서울은 오늘 어때']
	chats = list(map(lambda t: [{'role': 'user', 'content': f'{sys_prompt}\n{t}'}], texts)) # ChatML format
	prompts = list(map(lambda p: tokenizer.apply_chat_template(p, tokenize=False, add_generation_prompt=True), chats))
	input_ids = tokenizer(prompts, return_tensors="pt", padding=True).to("cuda" if device is None else device)
	outputs = model.generate(**input_ids, max_new_tokens=100, repetition_penalty=1.05)
	for output in outputs:
	print(tokenizer.decode(output, skip_special_tokens=True), end='\n\n')
	```

	## Training Details
	Trained on 2 x RTX3090

	[More Information on Github source code](https://github.com/seonglae/yokhal/blob/master/train.py)

	### Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	[More Information Needed]

	### Training Procedure

	1. Weight Initialized from Internet comments dataset
	2. Trained on Korean Namuwiki dataset until step 80000 (30000 step is on main branch because of repetition issue above there)
	- `seq_length` 1024 with dataset packing
	- `batch` 3 per device
	- `lr` 1e-5
	- `optim` adafactor
	3. Instruction tuning on Korean Instruction Dataset using QLoRa (not on main)
	- `seq_length` 2048
	- `lr` 2e-4

	#### Preprocessing [optional]
	Gemma do not support explicit system prompt in ChatML, so I trained putting system prompt before user message like below
	```py
	if (chat[0]['role'] == 'system'):
	chat[1]['content'] = f"{chat[0]['content']}\n{chat[1]['content']}"
	chat = chat[1:]
	try:
	prompt = tokenizer.apply_chat_template(chat, tokenize=False)
	```
	[Source Code](https://github.com/seonglae/yokhal/blob/master/yokhal/adapt.py)


	#### Training Hyperparameters

	- Training regime: [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

	#### Speeds, Sizes, Times [optional]

	<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

	[More Information Needed]

	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	### Testing Data, Factors & Metrics

	#### Testing Data

	<!-- This should link to a Dataset Card if possible. -->

	[More Information Needed]

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	[More Information Needed]

	### Results

	[More Information Needed]

	#### Summary