about

A merge between a L3 70b model (Dolphin 2.9.1) and a L3.1 70b base model (Tess 3) inspired by https://huggingface.co/sophosympatheia/New-Dawn-Llama-3.1-70B-v1.1 and https://huggingface.co/jukofyork/Dusk-Miqu-70B .

I made 10 different versions. I retained 3 RC (Formely in my NexesMess repo, this one was Flipper 0.38. Flipper 0.36 will be Tess Dolphin 1.1. Flipper 0.32 will be Tess Dolphin 1.0)
I (re?)added q.proj, k.proj (the rope base frequency is similar between L3 and L3.1), imput.layernorm and post.attention.layernorm (why not?) to Sophos' mix, so all tensors are merged the same way (I guess I should just have gone layer-wide then, but whatever lol).
This v1.2 respects a quasi-triangular shape for its merge gradient, from layer 4 to level 75. It has a relatively high perplexity. (3.95) The most "Dolphin inprinted", for kicks & giggles.
the v1.1 leaves untouched the 8 first and 8 last layers of Tess. It has an intermediary perplexity. (3.75) A solid and balanced standalone version.
the V1.0 leaves untouched the 16 first and 16 last layers of Tess. It has the lowest perplexity. (3.55). Probably the most "mergeable" version with L3.1 and 3.3 models.
Beyond the maths I don't master (merge stuff, rope stuff, etc), one reason for that bump might be that the closer you get to Dolphin (thus to L3 70b), the higher perplexity you get because Llama 3 70b had itself a perplexity superior to 4.5 (unless my quants were not fresh lol), even if Dolphin went way lower (Less than 3.65).
And why Tess 3.1 as a base? Simple, I wanted a low perplexity base to compensate for L3 models high ppl, and Tess 3 70b is the best L3.1 instruct capable model at this game. (2.90 ppl 512 Wikitext Eng)

Purple is v1.2 (4+4 untouched layers), Red is v1.1 (8+8 untouched layers), Green is v1.0 (16+16 untouched layers).
You can multiply the value on the graphic by 2, because that's the epsilon I used, in the way demonstrated by the aforementioned guys.

Caution, this model is VASTLY uncensored.

And good news, the grammar problems mentioned by mergers (notably between L2/Miqu and L3 models) when some layers among the 16 first (to not speak about the 8 first) seem to be extremely limited on the 3 versions.

The hiccups I could observe are in the margin of error of what I could observe on many L3.1/L3.3 finetunes.

As for the intelligence of the models, well, I might have seen better with pure L3.1/L3.3 models & merges, but it's not debilitated. As for the creativity, well, those L3/L3.1 merges might not the smartest ducks of the pond, but they are colorful alright!

benchs

PPL 512 Wikitext Eng : 3.95 (passable), v1.1 does 3.75, v1.0 does 3.55.
ARC-C : 58.85 (average ++), v1.1 does 61.85, v1.0 does 61.20.
ARC-E : 76.15 (average), v1.1 does 80, v1.0 does 78.75.

tests

Let's see if it can hold long context (on testing):

At 10k, it holds coherence.
At 20k, it holds coherence.
At 28k, it holds coherence.
That's good enough to validate this release, because the merged material is a 8k context L3, not even a Miqu (32k context).

merge

This is a merge of pre-trained language models created using mergekit.

Merge Details

Merge Method

This model was merged using the Linear DELLA merge method using migtissera/Tess-3-Llama-3.1-70B as a base.

Models Merged

The following models were included in the merge:

cognitivecomputations/dolphin-2.9.1-llama-3-70b

Configuration

The following YAML configuration was used to produce this model:

merge_method: della_linear
base_model: migtissera/Tess-3-Llama-3.1-70B
models:
  - model: cognitivecomputations/dolphin-2.9.1-llama-3-70b
    parameters:
      weight:
        - filter: q_proj
          value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
        - filter: k_proj
          value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
        - filter: v_proj
          value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
        - filter: o_proj
          value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
        - filter: input_layernorm
          value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
        - filter: up_proj
          value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
        - filter: gate_proj
          value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
        - filter: down_proj
          value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
        - filter: post_attention_layernorm
          value: [0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0]
        - value: 0
      density: 0.5
      epsilon: 0.1
      lambda: 1.0
  - model: migtissera/Tess-3-Llama-3.1-70B
    parameters:
        weight: 1.0
        density:
          - filter: q_proj
            value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
          - filter: k_proj
            value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
          - filter: v_proj
            value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
          - filter: o_proj
            value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
          - filter: input_layernorm
            value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
          - filter: up_proj
            value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
          - filter: gate_proj
            value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
          - filter: down_proj
            value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
          - filter: post_attention_layernorm
            value: [1, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 1]
          - value: 0.5
        epsilon:
          - filter: q_proj
            value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
          - filter: k_proj
            value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
          - filter: v_proj
            value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
          - filter: o_proj
            value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
          - filter: input_layernorm
            value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
          - filter: up_proj
            value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
          - filter: gate_proj
            value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
          - filter: down_proj
            value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
          - filter: post_attention_layernorm
            value: [0, 0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0,09, 0.08, 0,07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01, 0]
          - value: 0.1
        lambda: 1.0
dtype: bfloat16
out_dtype: bfloat16
parameters:
  int8_mask: true
  normalize: true
  rescale: true
  filter_wise: false
chat_template: auto
tokenizer:
  source: union

Nexesenex
/

Llama_3.x_70b_Tess_Dolphin_128K_v1.2