What about finetuning on qwq 32b?

#5
by Ainonake - opened

People on stv subreddit seem to like it in RP, some say it's "close to r1".

I myself haven't tested it yet, but maybe it's a good idea? Seems like a strong base for thinking.

I'm doing a train on QwQ, on one GPU on the side, lora 128 r 64 a, same dataset than "MistralThinker" but renamed for the funnies (QwQ-RP)
I like QwQ but the thinking process seems to suffer from the same issue than MistralThinker. Sometime the reply isn't good. Or the thinking process isn't used properly.
I will update the trained model if it's worth uploading

I don't think doing a full finetune on top is worth it (it's not the base)

Welp I lost 70$ and my pods crashed this night, NICE
I do this train again and I go fuck myself a little for some days because fuck this hobby sometime

Edit: Launched a train on 4x GPU this time, will take like 4h. I also got from 128 to 256 lora rank.
What a pain to train a model we just done lmao. I woke up 5 minutes too late. Grrrr.

I'm not sure why people like it in RP tbh. It's a very fucking intelligent model, for its size, and for like 3-4 messages. Even if you ignore the formatting, It has big drawbacks in areas that seem (to me) important to RP: little to no ability to realize that what's being talked about is not happening right now (also annoying outside of RP when you're using it for projects), very keen to forget if a sentence 2 messages ago was said by the user or by the model itself, and so on. And god forbid you left an "(OOC: do this)" somewhere in your prompt 30 messages ago, it's gonna hyper focus on it. And it's heavily censored, which you can sorta go around with some thinking tag prefill, but not in a consistent fashion.

At least I got it (and MistralThinker) to no longer emphasis random words thanks to a lot of regex, far from perfect, but legible.

What I'm wondering is what would happen if, during inference, instead of letting it blab nonsense in the CoT tags, every world info, RAG result, and other programmatically triggered context stuff were put there, instead of being inserted arbitrarily in the prompt? Food for thought :)

Sign up or log in to comment