Blackroot commited on
Commit
a0e69de
·
verified ·
1 Parent(s): c6d5726

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -1
README.md CHANGED
@@ -54,16 +54,29 @@ Alright. We're still doing the evolutionary thing, take the previous generation,
54
  # Base Folding
55
  We have competing objectives for the model. We want model diversity for interesting storytelling. That's really great. However, increasing model diversity also increases EOS diversity. This is a problem. We also want the model to be able to shut up when it wants to. There's a problem, modelstock is using geometric interpolation. In the case of lots of EOS tokens, it's going to act a bit like averaging them. This is a huge problem. We can't disagree about which EOS to use. Hence, we're going to fold the bases together, which all have the same EOS tokens. Specifically, we're going to do a modelstock merge like this:
56
 
 
57
  - Bllossom/llama-3-Korean-Bllossom-70B
58
  - rinna/llama-3-youko-70B
59
  - hitachi-nlp/Llama-3.1-70B-FLDx2
60
- - (MergedHistLlama -- A custom model stock merge itself of multiple See [Version 0.2 For the details](https://huggingface.co/Blackroot/Mirai-70B-0.2))
61
  - tokyotech-llm/Llama-3.1-Swallow-70B-v0.1
62
  Base model:
63
  - meta-llama/Meta-Llama-3-70B
64
 
65
  All of these were folded into the base model, and they critically all agreed about the EOS token.
66
 
 
 
 
 
 
 
 
 
 
 
 
 
67
  # Model Stock is Really bad at this EOS thing
68
  I love the interest and diversity of model stock, but after trying many different merges, I realized that I had to try something else. Specifically, TIES. Model stock and TIES are almost polar opposites. TIES acts as an amplifier, when models agree, the task vectors align, TIES strengthens those underlying weights, this means things that are good about the model get amplified, things that are bad get amplified. Model stock smoothes things out, smoothing the weights out between models. If smoothing is an issue, lets try amplifying. I've avoided TIES merges, because I'm specifically trying to avoid some of the bad mannerisms of the base ablation nemo model. However, I tried it anyways. Wouldn't you know it, TIES preserved the EOS, and can actually shut up most of the time. Not only that, but the model result is good. Quite good. The instruct is simply better than prior Mirai's and I don't think it's by a small margin either. There's some quirks, but I'm still able to run inference without any penalties and with the same sampling settings I've been running with. This was really surprising to me, I had not anticipated good results with TIES merging, but I'll eat my shoes now, it's good. The model is by no means perfect, there's some edge areas that end up in strange outputs, and the model occassionaly will insert phrases that appeared commonly into the end of responses. However, overall, I like the result.
69
 
 
54
  # Base Folding
55
  We have competing objectives for the model. We want model diversity for interesting storytelling. That's really great. However, increasing model diversity also increases EOS diversity. This is a problem. We also want the model to be able to shut up when it wants to. There's a problem, modelstock is using geometric interpolation. In the case of lots of EOS tokens, it's going to act a bit like averaging them. This is a huge problem. We can't disagree about which EOS to use. Hence, we're going to fold the bases together, which all have the same EOS tokens. Specifically, we're going to do a modelstock merge like this:
56
 
57
+ Model Stock:
58
  - Bllossom/llama-3-Korean-Bllossom-70B
59
  - rinna/llama-3-youko-70B
60
  - hitachi-nlp/Llama-3.1-70B-FLDx2
61
+ - (MergedHistLlama -- A custom model stock merge itself of multiple See below)
62
  - tokyotech-llm/Llama-3.1-Swallow-70B-v0.1
63
  Base model:
64
  - meta-llama/Meta-Llama-3-70B
65
 
66
  All of these were folded into the base model, and they critically all agreed about the EOS token.
67
 
68
+ # What is the MergedHistLlama?
69
+ Well, it's more merging, Specifically of six base models from PKU-Alignment. The selection of the particular base model was done at random.
70
+
71
+ Model Stock:
72
+ - PKU-Alignment/ProgressGym-HistLlama3-70B-C013-pretrain-v0.1
73
+ - PKU-Alignment/ProgressGym-HistLlama3-70B-C020-pretrain-v0.1
74
+ - PKU-Alignment/ProgressGym-HistLlama3-70B-C017-pretrain-v0.1
75
+ - PKU-Alignment/ProgressGym-HistLlama3-70B-C014-pretrain-v0.1
76
+ - PKU-Alignment/ProgressGym-HistLlama3-70B-C021-pretrain-v0.1
77
+ Base Model:
78
+ - PKU-Alignment/ProgressGym-HistLlama3-70B-C018-pretrain-v0.1
79
+
80
  # Model Stock is Really bad at this EOS thing
81
  I love the interest and diversity of model stock, but after trying many different merges, I realized that I had to try something else. Specifically, TIES. Model stock and TIES are almost polar opposites. TIES acts as an amplifier, when models agree, the task vectors align, TIES strengthens those underlying weights, this means things that are good about the model get amplified, things that are bad get amplified. Model stock smoothes things out, smoothing the weights out between models. If smoothing is an issue, lets try amplifying. I've avoided TIES merges, because I'm specifically trying to avoid some of the bad mannerisms of the base ablation nemo model. However, I tried it anyways. Wouldn't you know it, TIES preserved the EOS, and can actually shut up most of the time. Not only that, but the model result is good. Quite good. The instruct is simply better than prior Mirai's and I don't think it's by a small margin either. There's some quirks, but I'm still able to run inference without any penalties and with the same sampling settings I've been running with. This was really surprising to me, I had not anticipated good results with TIES merging, but I'll eat my shoes now, it's good. The model is by no means perfect, there's some edge areas that end up in strange outputs, and the model occassionaly will insert phrases that appeared commonly into the end of responses. However, overall, I like the result.
82