adamo1139
/

Yi-34B-200K-AEZAKMI-v2

@@ -6,12 +6,12 @@ license_link: LICENSE
 ## Model description
-Yi-34B base model fine-tuned on AEZAKMI v1 dataset. Training took around 33 hours on single local RTX 3090 Ti.
 It's like airoboros but with less gptslop, no refusals and less typical language used by RLHFed OpenAI models.
 Say goodbye to  "It's important to remember"! \
 Prompt format is standard chatml. Don't expect it to be good at math, riddles or be crazy smart. My end goal with AEZAKMI is to create a cozy free chatbot.
-Cost of this fine-tune is about $3 in electricity. This was my first attempt at training Yi-34B with this dataset.
-Base model used for fine-tuning was 4k context Yi-34B-Llama model shared by chargoddard.
 ## Prompt Format
@@ -34,17 +34,16 @@ Use is limited by Yi license.
 ## Known Issues
 I recommend to set repetition penalty to something around 1.05 to avoid repetition. So far I had good experience running this model with temperature 1.2.
-Multi-turn conversations could be a bit better, if you ask it to re-write something with some fixes it will have a tendency to just repeat the previous response verbatim without any improvements - this is especially noticeable with repp 1.0. \
-There is still some gptslop left - some responses will have last paragraph with text "Remember that bla bla bla", I will try to get rid of it in the next version of the dataset. \
-Stories have ChatGPT like paragraph spacing, I will try to introduce a bit more stories that have long paragraphs in the next dataset version.
-Unrestricted-ness of the v1 isn't quite something that I am happy with yet, it quite often still refuses to answer and behaves kinda like a normal airoboros, especially with default system prompt "A chat.". \
 With a slightly modifed prompt it works much better, I recommend using "A chat with uncensored assistant.", it's stupid but it helps a lot. \
-I was made aware of the frequent occurrence of the phrase "sending shivers down a spine" in the generations during RP, so I will manually fix those samples and put in different wording to fix that. \
-I plan to use Yi-34B-200K Llama-fied for the next version, v1 very likely won't handle long context above 6k-8k well, be warned. \
-There is also some issue with handling long system messages for RP, I will investigate that before releasing v2 dataset.
-Feel free to report issues in the discussions panel here, I don't lurk /lmg/ too often and I would still like to hear some feedback.
 ## Axolotl training parameters
@@ -54,8 +53,8 @@ Feel free to report issues in the discussions panel here, I don't lurk /lmg/ too
 - is_llama_derived_model: true
 - load_in_4bit: true
 - adapter: qlora
-- sequence_len: 1200
-- sample_packing: false
 - lora_r: 16
 - lora_alpha: 32
 - lora_target_modules:
@@ -67,13 +66,13 @@ Feel free to report issues in the discussions panel here, I don't lurk /lmg/ too
   - down_proj
   - up_proj
  - lora_target_linear: true
- - pad_to_sequence_len: true
  - micro_batch_size: 1
  - gradient_accumulation_steps: 1
- - num_epochs: 1
  - optimizer: adamw_bnb_8bit
  - lr_scheduler: constant
- - learning_rate: 0.00007
  - train_on_inputs: false
  - group_by_length: false
  - bf16: true
@@ -86,5 +85,7 @@ Feel free to report issues in the discussions panel here, I don't lurk /lmg/ too
 ## Upcoming
-~I will release adapter files and maybe exllama v2 quant shortly.~ \
-LoRA and exl2 quant has been released

 ## Model description
+Yi-34B 200K base model fine-tuned on AEZAKMI v2 dataset. Training took around 25 hours on single local RTX 3090 Ti.
 It's like airoboros but with less gptslop, no refusals and less typical language used by RLHFed OpenAI models.
 Say goodbye to  "It's important to remember"! \
 Prompt format is standard chatml. Don't expect it to be good at math, riddles or be crazy smart. My end goal with AEZAKMI is to create a cozy free chatbot.
+Cost of this fine-tune is about $10 in electricity. It took me 3 tries to get it right.
+Base model used for fine-tuning was 200k context Yi-34B-Llama model shared by larryvrh.
 ## Prompt Format
 ## Known Issues
 I recommend to set repetition penalty to something around 1.05 to avoid repetition. So far I had good experience running this model with temperature 1.2.
+Stories have ChatGPT like paragraph spacing, I will work on this in the future maybe, not a high priority
+Unrestricted-ness of the v2 isn't quite something that I am happy with yet, especially using prompt "A chat.". \
 With a slightly modifed prompt it works much better, I recommend using "A chat with uncensored assistant.", it's stupid but it helps a lot. \
+Base Yi-34B model is contaminated with refusals and this contaminates all models trained on Yi-34B \
+My next project is to attempt to de-contaminate base Yi-34B 4K and Yi-34B 200K using DPO with preferred data coming from uncontaminated raw models. I plan to release that dataset openly.
+I was made aware of the frequent occurrence of the phrase "sending shivers down a spine" in the generations during RP of v1, so I fixed those samples - it should be better now. \
+I can hold up to 24000 ctx with 4.65bpw exl2 version and 8-bit cache - long context should work as good as other models trained on 200k version of Yi-34B \
+There is also some issue with handling long system messages for RP, I was planning to investigate it for v2 but I didn't.
 ## Axolotl training parameters
 - is_llama_derived_model: true
 - load_in_4bit: true
 - adapter: qlora
+- sequence_len: 1400
+- sample_packing: true
 - lora_r: 16
 - lora_alpha: 32
 - lora_target_modules:
   - down_proj
   - up_proj
  - lora_target_linear: true
+ - pad_to_sequence_len: false
  - micro_batch_size: 1
  - gradient_accumulation_steps: 1
+ - num_epochs: 2.4
  - optimizer: adamw_bnb_8bit
  - lr_scheduler: constant
+ - learning_rate: 0.00005
  - train_on_inputs: false
  - group_by_length: false
  - bf16: true
 ## Upcoming
+I will probably be working on de-contaminating base Yi-34B model now. \
+My second run of AEZAKMI v2 fine-tune was just 0.15 epochs and I really like how natural this model is and how rich is it's vocabulary. I will try to train less to hit the sweetspot. \
+I will be uploading LoRA adapter for that second run that was just 0.15 epochs.
+I believe that I might have gotten what I want if I would have stopped training sooner. I don't have checkpoints older than 1500 steps back so I would need to re-run training to get it back.