File size: 2,657 Bytes
09e70cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0d28f16
09e70cb
 
1064a77
 
 
 
09e70cb
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
---
license: mit
base_model:
- unsloth/phi-4
library_name: transformers
---
# WAIDWML - What Am I Doing With My Life?
### (8 Phi-4s in a trenchcoat)

## Rationale
So there I was, finding some inspiration to tune stuff but lacking the disposable funds to do anything with the larger models. Enter Phi-4, a model designed for productivity...
Initially it  was just a going to be a sequential series of finetunes, starting from the baseline Phi-4 and gradually adding more datasets until I either got bored or it got good, but then I had an idea; what if I just MoE'd it?

Yeah.

As a proof of concept, this wasn't too bad. The end result is... _interesting_, to say the least.

## Training
As mentioned above, this was done in "phases", each with a separate dataset. Most were done with a `max_seq_length` of 32k, a few of them were dropped to 16k to make sure they fit in the hardware.

`lr` was all over the place but in general somewhere between `1e-5` and `4e-6`. These were all separate LoRAs using `r=64` and `alpha=32` with rsLoRA enabled. `epochs` were 2 or 3 for everything except `c2`, as that'd take far too long.

- `p1`: Private RP dataset (`RPT-Varied-Small`)
- `p2`: `TheDrummer/AmoralQA-v2`
- `p3`: `AIRRC/Eudaimonic`
- `p4`: Two private RP datasets (`cc-gpt4-sfw-sharegpt` & `cc-gpt4-nsfw-sharegpt`)
- `p5`: A random subset of the infamous "`c2`"-logs dataset, cleaned and deduped (approx. 30%)
- `p6`: Private RP dataset (`RPT-Varied-Small_v1.5`)
- `p7`: `NewEden/PIPPA-Mega-Filtered`
- `p8`: `Squish42/bluemoon-fandom-1-1-rp-cleaned`

(Note: the `RPT-Varied-Small` and `RPT-Varied-Small_v1.5` datasets are due to be released after I manually verify their fitness.)

Once all LoRAs were trained, I separately merged them into the base model then I used [mergekit](https://github.com/arcee-ai/mergekit) [(config)](https://huggingface.co/rAIfle/WAIDWML-Phi4-8x14B-bf16/blob/main/mergekit_moe_config.yml) to "merge" them into a MoE. I chose to initialize the router randomly as I was going to training that part later. After that, I trained the routing layers for 8 epochs with `lr = 1e-6` and `grimulkan/LimaRP-augmented` as the dataset. It took roughly 8.5 hours on a 6xA40 instance on RunPod.

## Recommended Settings
Phi-4 format. What I used for my tests:
- Temp 1
- minP 0.05


## FAQ
```
Q: Why not do anything constructive, like GRPO-tune a model of usable size?
A: Where's the fun in that?

Q: Are you, like, okay?
A: Objectively? Probably not. Subjectively? Never better.

Q: You know this still sucks for RP, right?
A: Yup. Should have pivoted to reasoning and code once R1 hit, but sunk cost and all kept me on this trajectory.
```