about
A merge between a L3 70b model (Dolphin 2.9.1) and a L3.1 70b base model (Tess 3) inspired by https://huggingface.co/sophosympatheia/New-Dawn-Llama-3.1-70B-v1.1 and https://huggingface.co/jukofyork/Dusk-Miqu-70B .
- I made 10 different versions. I retained 3 RC (Formely in my NexesMess repo, this one was Flipper 0.38. Flipper 0.36 will be Tess Dolphin 1.1. Flipper 0.32 will be Tess Dolphin 1.0)
- I (re?)added q.proj, k.proj (the rope base frequency is similar between L3 and L3.1), imput.layernorm and post.attention.layernorm (why not?) to Sophos' mix, so all tensors are merged the same way (I guess I should just have gone layer-wide then, but whatever lol).
- This v1.2 respects a quasi-triangular shape for its merge gradient, from layer 4 to level 75. It has a relatively high perplexity. (3.95) The most "Dolphin inprinted", for kicks & giggles.
- the v1.1 leaves untouched the 8 first and 8 last layers of Tess. It has an intermediary perplexity. (3.75) A solid and balanced standalone version.
- the V1.0 leaves untouched the 16 first and 16 last layers of Tess. It has the lowest perplexity. (3.55). Probably the most "mergeable" version with L3.1 and 3.3 models.
- Beyond the maths I don't master (merge stuff, rope stuff, etc), one reason for that bump might be that the closer you get to Dolphin (thus to L3 70b), the higher perplexity you get because Llama 3 70b had itself a perplexity superior to 4.5 (unless my quants were not fresh lol), even if Dolphin went way lower (Less than 3.65).
- And why Tess 3.1 as a base? Simple, I wanted a low perplexity base to compensate for L3 models high ppl, and Tess 3 70b is the best L3.1 instruct capable model at this game. (2.90 ppl 512 Wikitext Eng)
- Purple is v1.2 (4+4 untouched layers), Red is v1.1 (8+8 untouched layers), Green is v1.0 (16+16 untouched layers).
- You can multiply the value on the graphic by 2, because that's the epsilon I used, in the way demonstrated by the aforementioned guys.
Caution, this model is VASTLY uncensored.
And good news, the grammar problems mentioned by mergers (notably between L2/Miqu and L3 models) when some layers among the 16 first (to not speak about the 8 first) seem to be extremely limited on the 3 versions.
- The hiccups I could observe are in the margin of error of what I could observe on many L3.1/L3.3 finetunes.
As for the intelligence of the models, well, I might have seen better with pure L3.1/L3.3 models & merges, but it's not debilitated. As for the creativity, well, those L3/L3.1 merges might not the smartest ducks of the pond, but they are colorful alright!
benchs
- PPL 512 Wikitext Eng : 3.95 (passable), v1.1 does 3.75, v1.0 does 3.55.
- ARC-C : 58.85 (average ++), v1.1 does 61.85, v1.0 does 61.20.
- ARC-E : 76.15 (average), v1.1 does 80, v1.0 does 78.75.
tests
Let's see if it can hold long context (on testing):
- At 10k, it holds coherence.
- At 20k, it holds coherence.
- At 28k, it holds coherence.
- That's good enough to validate this release, because the merged material is a 8k context L3, not even a Miqu (32k context).
merge
This is a merge of pre-trained language models created using mergekit.
Merge Details
Merge Method
This model was merged using the Linear DELLA merge method using migtissera/Tess-3-Llama-3.1-70B as a base.
Models Merged
The following models were included in the merge:
Configuration
The following YAML configuration was used to produce this model:
merge_method: della_linear
base_model: migtissera/Tess-3-Llama-3.1-70B
models:
- model: cognitivecomputations/dolphin-2.9.1-llama-3-70b
parameters:
weight:
- filter: q_proj
value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
- filter: k_proj
value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
- filter: v_proj
value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
- filter: o_proj
value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
- filter: input_layernorm
value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
- filter: up_proj
value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
- filter: gate_proj
value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
- filter: down_proj
value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
- filter: post_attention_layernorm
value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
- value: 0
density: 0.5
epsilon: 0.1
lambda: 1.0
- model: migtissera/Tess-3-Llama-3.1-70B
parameters:
weight: 1.0
density:
- filter: q_proj
value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
- filter: k_proj
value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
- filter: v_proj
value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
- filter: o_proj
value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
- filter: input_layernorm
value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
- filter: up_proj
value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
- filter: gate_proj
value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
- filter: down_proj
value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
- filter: post_attention_layernorm
value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
- value: 0.5
epsilon:
- filter: q_proj
value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
- filter: k_proj
value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
- filter: v_proj
value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
- filter: o_proj
value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
- filter: input_layernorm
value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
- filter: up_proj
value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
- filter: gate_proj
value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
- filter: down_proj
value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
- filter: post_attention_layernorm
value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
- value: 0.1
lambda: 1.0
dtype: bfloat16
out_dtype: bfloat16
parameters:
int8_mask: true
normalize: true
rescale: true
filter_wise: false
chat_template: auto
tokenizer:
source: union
- Downloads last month
- 0