metadata

base_model:
  - huihui-ai/Llama-3.1-Nemotron-70B-Instruct-HF-abliterated
  - Sao10K/L3.3-70B-Euryale-v2.3
  - yentinglin/Llama-3-Taiwan-70B-Instruct
  - PKU-Baichuan-MLSystemLab/Llama3-PBM-Nova-70B
  - rinna/llama-3-youko-70B
  - Bllossom/llama-3-Korean-Bllossom-70B
  - hitachi-nlp/Llama-3.1-70B-FLDx2
  - tokyotech-llm/Llama-3.1-Swallow-70B-v0.1
  - ProgressGym-HistLlama3-70B-C018-pretrain-v0.1
  - ProgressGym-HistLlama3-70B-C013-pretrain-v0.1
  - ProgressGym-HistLlama3-70B-C020-pretrain-v0.1
  - ProgressGym-HistLlama3-70B-C017-pretrain-v0.1
  - ProgressGym-HistLlama3-70B-C014-pretrain-v0.1
  - ProgressGym-HistLlama3-70B-C021-pretrain-v0.1
tags:
  - roleplay
  - experimental
  - indie
  - merge

New base model, this one actually expects you to use Llama 3 Instruct format. There's lots to talk about here, and this model is TOTALLY different than previous version for a variety of reasons.

This is evolution 1. Yes, I know it makes no sense. I explain this in my rant down below. I'm going to list the recipe for the model now, but know the reality is more complex than just this:

Model Architecture

This is a stockmerge and TIES model. Thanks mergekit for making this really easy to do.

Stock for the "True Merge" -- This was a TIES Merge, the reasoning is explained below for using TIES over Model Stock this time. Although, model stock was also used.

PKU-Baichuan-MLSystemLab/Llama3-PBM-Nova-70B
yentinglin/Llama-3-Taiwan-70B-Instruct
Sao10K/L3.3-70B-Euryale-v2.3
(Custom Base Model-Stock Soup -- Recipe Below)

One note here, I wasn't really sure how to state this in the huggingface tags. This model is actually THREE different merges. There's a base history merge, which was rolled into a base model merge, and you can see we merged the bases with our instruct models. Whew. I tried to give a thorough overview of model contributions, but not all of them contribute to the "final" merge directly.

Why a different approach?

As some users had noted, particularly thanks to |GodZio| and The-Istar, the previous Mirai's instruct format was very unclear. Infact, when testing Llama-3 instruct format it seemed just broken, and, it was. Why? Well, the issue was with merging multiple models with different stopping tokens. I'll leave a tecnical explanation below for my assumption about why this happened. The long story short, I changed strategies for this model. It's very different, and expects the Llama-3 format to be used.

Possible cause of the issue (Technical)

Llama-3 instruct alone has two distinct EOS tokens, and models like Hermes have it's own EOS. What appears to have happened is that ALL of the EOS tokens basically lost weighting because the weights spread around different EOS tokens. And, this meant that the model did not know which EOS token to produce. Merge enough different models like this, and you end up with no EOS token ever being generated. There's other issues at play, hermes has a different number of tokens, so the hermes EOS does not actually make it into the merge at all, meaning models like Hermes effectively erase the EOS when merging against smaller heads. The puzzling part of this is why Llama-3 format was apparently so disproportionately affected by the merge. I don't have a clear answer for this at all.

New strategy (Still Evolutionary)

Alright. We're still doing the evolutionary thing, take the previous generation, and either add or remove a model. However, this version is a bit of a rollercoaster, I spent lots of time with modelstock trying to ensure I could preserve the EOS token, and it essentially boiled down to only leaving a handful of models with a very sub-par result. I really was not a fan. So much so, I changed directions.

Base Folding

We have competing objectives for the model. We want model diversity for interesting storytelling. That's really great. However, increasing model diversity also increases EOS diversity. This is a problem. We also want the model to be able to shut up when it wants to. There's a problem, modelstock is using geometric interpolation. In the case of lots of EOS tokens, it's going to act a bit like averaging them. This is a huge problem. We can't disagree about which EOS to use. Hence, we're going to fold the bases together, which all have the same EOS tokens. Specifically, we're going to do a modelstock merge like this:

Bllossom/llama-3-Korean-Bllossom-70B
rinna/llama-3-youko-70B
hitachi-nlp/Llama-3.1-70B-FLDx2
(MergedHistLlama -- A custom model stock merge itself of multiple See Version 0.2 For the details)
tokyotech-llm/Llama-3.1-Swallow-70B-v0.1 Base model:
meta-llama/Meta-Llama-3-70B

All of these were folded into the base model, and they critically all agreed about the EOS token.

Model Stock is Really bad at this EOS thing

I love the interest and diversity of model stock, but after trying many different merges, I realized that I had to try something else. Specifically, TIES. Model stock and TIES are almost polar opposites. TIES acts as an amplifier, when models agree, the task vectors align, TIES strengthens those underlying weights, this means things that are good about the model get amplified, things that are bad get amplified. Model stock smoothes things out, smoothing the weights out between models. If smoothing is an issue, lets try amplifying. I've avoided TIES merges, because I'm specifically trying to avoid some of the bad mannerisms of the base ablation nemo model. However, I tried it anyways. Wouldn't you know it, TIES preserved the EOS, and can actually shut up most of the time. Not only that, but the model result is good. Quite good. The instruct is simply better than prior Mirai's and I don't think it's by a small margin either. There's some quirks, but I'm still able to run inference without any penalties and with the same sampling settings I've been running with. This was really surprising to me, I had not anticipated good results with TIES merging, but I'll eat my shoes now, it's good. The model is by no means perfect, there's some edge areas that end up in strange outputs, and the model occassionaly will insert phrases that appeared commonly into the end of responses. However, overall, I like the result.

Ties is Really, Really slow, And also Evolutions or Something

Ties takes something like 10-15x longer than model stock. It has to calculate a bunch of fancy vectors and directions, and this is slow. In practice, this means it's even slower for me to iterate on evolutions. Speaking of, which evolution is this? Well, this is where it gets weird, because the previous evolution was 13, but all of these were done as model stock merges. I just decide to switch to TIES out of the blue. This means that bascially none of the other evolutions were relevant in the analysis of this one, since changing the merging strategy basically rebases the whole project. Therefore, this is sort of evolution 1, despite having many models incorporated already. I'll be calling it evolution 1, but just know this was the actual reality of the situation.