Blackroot
/

Mirai-3.0-70B

@@ -43,7 +43,7 @@ One note here, I wasn't really sure how to state this in the huggingface tags. T
 # Why a different approach?
-As some users had noted, particularly thanks to |GodZio| and The-Istar, the previous Mirai's instruct format was very unclear. Infact, when testing Llama-3 instruct format it seemed just broken, and, it was. Why? Well, the issue was with merging multiple models with different stopping tokens. I'll leave a tecnical explanation below for my assumption about why this happened. The long story short, I changed strategies for this model. It's very different, and expects the Llama-3 format to be used.
 # Possible cause of the issue (Technical)
 Llama-3 instruct alone has two distinct EOS tokens, and models like Hermes have it's own EOS. What appears to have happened is that ALL of the EOS tokens basically lost weighting because the weights spread around different EOS tokens. And, this meant that the model did not know which EOS token to produce. Merge enough different models like this, and you end up with no EOS token ever being generated. There's other issues at play, hermes has a different number of tokens, so the hermes EOS does not actually make it into the merge at all, meaning models like Hermes effectively erase the EOS when merging against smaller heads. The puzzling part of this is why Llama-3 format was apparently so disproportionately affected by the merge. I don't have a clear answer for this at all.
@@ -80,10 +80,10 @@ Base Model:
 - PKU-Alignment/ProgressGym-HistLlama3-70B-C018-pretrain-v0.1
 # Model Stock is Really bad at this EOS thing
-I love the interest and diversity of model stock, but after trying many different merges, I realized that I had to try something else. Specifically, TIES. Model stock and TIES are almost polar opposites. TIES acts as an amplifier, when models agree, the task vectors align, TIES strengthens those underlying weights, this means things that are good about the model get amplified, things that are bad get amplified. Model stock smoothes things out, smoothing the weights out between models. If smoothing is an issue, lets try amplifying. I've avoided TIES merges, because I'm specifically trying to avoid some of the bad mannerisms of the base ablation nemo model. However, I tried it anyways. Wouldn't you know it, TIES preserved the EOS, and can actually shut up most of the time. Not only that, but the model result is good. Quite good. The instruct is simply better than prior Mirai's and I don't think it's by a small margin either. There's some quirks, but I'm still able to run inference without any penalties and with the same sampling settings I've been running with. This was really surprising to me, I had not anticipated good results with TIES merging, but I'll eat my shoes now, it's good. The model is by no means perfect, there's some edge areas that end up in strange outputs, and the model occassionaly will insert phrases that appeared commonly into the end of responses. However, overall, I like the result.
 # Ties is Really, Really slow, And also Evolutions or Something
-Ties takes something like 10-15x longer than model stock. It has to calculate a bunch of fancy vectors and directions, and this is slow. In practice, this means it's even slower for me to iterate on evolutions. Speaking of, which evolution is this? Well, this is where it gets weird, because the previous evolution was 13, but all of these were done as model stock merges. I just decide to switch to TIES out of the blue. This means that bascially none of the other evolutions were relevant in the analysis of this one, since changing the merging strategy basically rebases the whole project. Therefore, this is sort of evolution 1, despite having many models incorporated already. I'll be calling it evolution 1, but just know this was the actual reality of the situation.
 # Merging is Hard to Direct
 Now, I never thought I'd be the one saying this. But here I am, merging is effective. It's not a free lunch, but it's certainly edible. There's issues though, and some of these probably can't be solved by merging. Let me go over some thoughts here in the following sections on what materially is not improving when merging.
@@ -104,7 +104,7 @@ Do models have a personality? What about a superposition of personalities? Peopl
 Now, you're going to say, what are you on about? How could you have two sections for this? Here's the thing, some parts of the model's personality are dictated by the dataset. Others however, are from guard rails in the RLHF steps. RLHF much more strongly changes a model's personality (air quotes or whatever), and there's a "true persona" underneath the roleplay that is essentially, guard rail persona. This is the moralizer, the flowery poet, and the ghost in the shell. If we can make any claims about a model's actual personality, it's this one, the one that resulted from directive RLHF, and often comes out to make totally out of place sentiments like "we should respect equality" in the middle of fighting a dragon. These sentiments are totally misplaced, and my own thoughts here are that RLHF guardrails do not transition well to roleplay settings. Luckily, some of this is reduced by merging, notably I find model stock really reduces these tendencies by regularizing across base models. However, then we have the EOS issue. Overall, this is a hard one to get the tradeoffs correct, but maybe you can.
 # So What?
-All these problems, how to fix? Well, fine tuning can help. Particularly, it will love areas of low coherence. But gathering, forming, and just the dataset work alone is really time consuming. I've done small-scale tunes for smaller models, but doing this on a real scale is just a huge problem. Principal among these issues, is that training huge models takes an insane amount of compute, and you'll likely need to try several times to get it right. That said, I've begin the arduous task of finding data, getting good regularization sets, and trying to isolate the areas of low coherence so I can dig up good examples of these for the models. So, maybe in the future some actual fine tunes will come out, but probably not until the next gen models are released. Data is fickle, getting the right balances are difficult, training over instruct tunes is very destructive, there's lots of real problems with this approach. However, after seeing the results of so many merges, I have some hope a combination of merging and fine tuning might solve some of the more eggregious issues I have with the current gen models.
 # Thinking about things, and actually doing the things
 I'm on discord often, if you'd like to talk models, datasets or something and you made it this far, I enjoy discussing all things machine learning not just limited to LLMs, there's a discord link under the mascot banner if you'd like to come chat.

 # Why a different approach?
+As some users had noted, particularly thanks to |GodZio| and The-Istar, the previous Mirai's instruct format was very unclear. Infact, when testing Llama-3 instruct format it seemed just broken, and, it was. Why? Well, the issue was with merging multiple models with different stopping tokens. I'll leave a technical explanation below for my assumption about why this happened. The long story short, I changed strategies for this model. It's very different, and expects the Llama-3 format to be used.
 # Possible cause of the issue (Technical)
 Llama-3 instruct alone has two distinct EOS tokens, and models like Hermes have it's own EOS. What appears to have happened is that ALL of the EOS tokens basically lost weighting because the weights spread around different EOS tokens. And, this meant that the model did not know which EOS token to produce. Merge enough different models like this, and you end up with no EOS token ever being generated. There's other issues at play, hermes has a different number of tokens, so the hermes EOS does not actually make it into the merge at all, meaning models like Hermes effectively erase the EOS when merging against smaller heads. The puzzling part of this is why Llama-3 format was apparently so disproportionately affected by the merge. I don't have a clear answer for this at all.
 - PKU-Alignment/ProgressGym-HistLlama3-70B-C018-pretrain-v0.1
 # Model Stock is Really bad at this EOS thing
+I love the interest and diversity of model stock, but after trying many different merges, I realized that I had to try something else. Specifically, TIES. Model stock and TIES are almost polar opposites. TIES acts as an amplifier, when models agree, the task vectors align, TIES strengthens those underlying weights, this means things that are good about the model get amplified, things that are bad get amplified. Model stock smoothes things out, smoothing the weights out between models. If smoothing is an issue, lets try amplifying. I've avoided TIES merges, because I'm specifically trying to avoid some of the bad mannerisms of the base ablation nemo model. However, I tried it anyways. Wouldn't you know it, TIES preserved the EOS, and can actually shut up most of the time. Not only that, but the model result is good. Quite good. The instruct is simply better than prior Mirai's and I don't think it's by a small margin either. There's some quirks, but I'm still able to run inference without any penalties and with the same sampling settings I've been running with. This was really surprising to me, I had not anticipated good results with TIES merging, but I'll eat my shoes now, it's good. The model is by no means perfect, there's some edge areas that end up in strange outputs, and the model occasionally will insert phrases that appeared commonly into the end of responses. However, overall, I like the result.
 # Ties is Really, Really slow, And also Evolutions or Something
+Ties takes something like 10-15x longer than model stock. It has to calculate a bunch of fancy vectors and directions, and this is slow. In practice, this means it's even slower for me to iterate on evolutions. Speaking of, which evolution is this? Well, this is where it gets weird, because the previous evolution was 13, but all of these were done as model stock merges. I just decide to switch to TIES out of the blue. This means that basically none of the other evolutions were relevant in the analysis of this one, since changing the merging strategy basically rebases the whole project. Therefore, this is sort of evolution 1, despite having many models incorporated already. I'll be calling it evolution 1, but just know this was the actual reality of the situation.
 # Merging is Hard to Direct
 Now, I never thought I'd be the one saying this. But here I am, merging is effective. It's not a free lunch, but it's certainly edible. There's issues though, and some of these probably can't be solved by merging. Let me go over some thoughts here in the following sections on what materially is not improving when merging.
 Now, you're going to say, what are you on about? How could you have two sections for this? Here's the thing, some parts of the model's personality are dictated by the dataset. Others however, are from guard rails in the RLHF steps. RLHF much more strongly changes a model's personality (air quotes or whatever), and there's a "true persona" underneath the roleplay that is essentially, guard rail persona. This is the moralizer, the flowery poet, and the ghost in the shell. If we can make any claims about a model's actual personality, it's this one, the one that resulted from directive RLHF, and often comes out to make totally out of place sentiments like "we should respect equality" in the middle of fighting a dragon. These sentiments are totally misplaced, and my own thoughts here are that RLHF guardrails do not transition well to roleplay settings. Luckily, some of this is reduced by merging, notably I find model stock really reduces these tendencies by regularizing across base models. However, then we have the EOS issue. Overall, this is a hard one to get the tradeoffs correct, but maybe you can.
 # So What?
+All these problems, how to fix? Well, fine tuning can help. Particularly, it will love areas of low coherence. But gathering, forming, and just the dataset work alone is really time consuming. I've done small-scale tunes for smaller models, but doing this on a real scale is just a huge problem. Principal among these issues, is that training huge models takes an insane amount of compute, and you'll likely need to try several times to get it right. That said, I've begin the arduous task of finding data, getting good regularization sets, and trying to isolate the areas of low coherence so I can dig up good examples of these for the models. So, maybe in the future some actual fine tunes will come out, but probably not until the next gen models are released. Data is fickle, getting the right balances are difficult, training over instruct tunes is very destructive, there's lots of real problems with this approach. However, after seeing the results of so many merges, I have some hope a combination of merging and fine tuning might solve some of the more egregious issues I have with the current gen models.
 # Thinking about things, and actually doing the things
 I'm on discord often, if you'd like to talk models, datasets or something and you made it this far, I enjoy discussing all things machine learning not just limited to LLMs, there's a discord link under the mascot banner if you'd like to come chat.