Downtown-Case/Star-Command-R-Lite-32B-v1

McUH

Sep 5, 2024

I think it achieves the goal. Plays roleplay more interestingly than basic CommandR 2024 32B and is not so lewd and horny as Star Command. I tried the exl2 4bpw quant and it works Ok, though it would be nice to have some GGUF's with higher precision (like Q6).

Downtown-Case

Owner Sep 5, 2024

•

edited Sep 5, 2024

I can make a Q6 soon, if one of the GGUF quantizers doesn't pick it up before then.

Is iMatrix worth using on Q6 these days? Guess I'll see...

McUH

Sep 8, 2024

Not sure about imatrix and Q6. From what I have heard it produces different results but is it worth it/better? Generally I only use imatrix quants up to 4bit.
I tried the exl2 4bpw a bit more, it is not bad, but it is very inconsistent, contradicting itself often, sometimes even within one message. Don't know if it is because of such low quant or if it is general problem with this Star Command finetune. Either way it is unfortunate because the exl2 4bpw quant could in theory be used for 60-80k context with 24GB VRAM, but if it can't really follow even 8k, then long context is not very useful.

Downtown-Case

Owner Sep 9, 2024

•

edited Sep 9, 2024

Is the regular command R at 4bpw working better in that respect?

I've been trying both back to back, still not sure yet. I feel like both mess up in different areas, and like you said it may be due to the extreme quantization.

One thing I've found, btw, is that command R likes extremely low temperature, especially if you use quadratic smoothing or something. It gets non deterministic even below 0.1, though I'm not sure about an optimal spot or anything.

Downtown-Case

Owner Sep 9, 2024

•

edited Sep 9, 2024

Another note, the HF to GGUF conversion script errors out with this model for some reason:

ValueError: Can not map tensor 'lm_head.weight'

But not with regular command R?

There's also another quirk I discovered earlier where this raw model is a few gigabytes larger than regular Command R. It seemed to quantize to exl2 find and end up at the same size, so I wrote it off... but now I'm not so sure. A linear merge seems to have the same result, as does manually specifying a tokenizer source.

Something might be messed up with mergekit and Command-R, not sure yet.

McUH

Sep 9, 2024

Yeah, you are probably right, checking notes from my original testing of c4ai-command-r-08-2024 32B Q6_K_L I have there "seems not very smart".
I did not try low temperature (actually later I went to experiment with higher temperature as base commandr 2024 often just gets stuck in scene, but yes, it makes it even more chaotic). But maybe low temperature can work for Star Command / lite as they are no longer so dry and repetitive as base CommandR 2024.

I no longer use quadratic smoothing. In general I try not to mess with token distribution much nowadays (except temperature, low minp like 0.02 to remove tail and DRY). As someone pointed out, the models train hard for long time on insane HW to learn token predictions. Simple sampler function that changes distribution is not going to improve it but more likely just mess with what they learned.

Downtown-Case

Owner Sep 9, 2024

•

edited Sep 9, 2024

Simple sampler function that changes distribution is not going to improve it but more likely just mess with what they learned.

I tend to agree with this. That being said there's no "true" distribution as sampling is largely picking the not-most-likely token to keep it from looping... but now that you say it, I will try skipping the distribution warping.

But yes, I find this model does not like a lot of rep penalty, not a lot of temperature (I am using 0.05 for short completions atm). Unfortunately I am not using DRY atm, as text-gen-web-ui is mega laggy at long context :(

tdh111

Nov 13, 2024

Another note, the HF to GGUF conversion script errors out with this model for some reason:
...
There's also another quirk I discovered earlier where this raw model is a few gigabytes larger than regular Command R.
...
Something might be messed up with mergekit and Command-R, not sure yet.

@Downtown-Case
I looked into this issue here and mradermacher was able to make GGUF quants with my suggestion.

Downtown-Case

Owner Nov 13, 2024

•

edited Nov 13, 2024

Thanks!

I think it may have been a bug with mergekit, actually. Its possible the model is a little off, but I am waiting for another Command-R finetune before trying a new merge.

tdh111

Nov 14, 2024

I think it may have been a bug with mergekit, actually.

Yes, I linked a mergekit issue in my comment, they only fixed it for gemma by marking lm_head.weight optional, but they did not do it for Command-R which does the same thing.

Its possible the model is a little off

The only thing that was preventing it from being made into a GGUF was the lm_head.weight created by mergekit which was redundant as embed_tokens.weight contains the same data, and the HF to GGUF script not accepting that.
If your exl2 version works the GGUF's made should also work. I haven't tested it yet but I do plan to.

Downtown-Case

Owner Nov 14, 2024

Ah, good. Thanks! The exl2 version does indeed work fine.

@tdh111 BTW, if you are looking out for more Command-R finetines, one I have my eye on is here: https://huggingface.co/jukofyork/creative-writer-v0.1-alfa-35b/discussions

The tuner has only modified the old command-r, but stated the new version with GQA should be next.

tdh111

Nov 14, 2024

BTW, if you are looking out for more Command-R finetines, one I have my eye on is here: https://huggingface.co/jukofyork/creative-writer-v0.1-alfa-35b/discussions

The tuner has only modified the old command-r, but stated the new version with GQA should be next.

Thanks for the recommendation. I found this on that page "The dataset consisted of approximately 1000 pre-2012 books" which makes me pretty interested.

On that note, at the mid 30b size do you prefer EVA ( and other Qwen based stuff) now to Command-R stuff? Or do you think they have different tradeoffs? (Also v0.2 of Eva came out that seems to be just a strict improvement over 0.1)

My first impressions with the new GQA 35b Command-R felt like it lacked or severely reduced the unique flavor that the original Command-R had, now it just felt like another synthetic data trained LLM. I haven't tried most of the new stuff that's been coming out, so maybe there is stuff better but I still find myself constantly going back to Midnight Miqu. It is pretty bad at picking up style from context, and is not that smart, lacks context size, and is much slower on my machine, but it writes well, isn't boring, and isn't absurdly horny like so many of the fine tunes I've tried.

Downtown-Case

Owner Nov 16, 2024

•

edited Nov 16, 2024

@tdh111

I've been using EVA (0.2 now) almost exclusively! It's great.

It feels like a base model, not a more slopped instruct tune, though it will still use instruct formatting.
It's legit great at 64K context, probably better than Command-R out there, and much better than Qwen Instruct with YaRN. The tokenizer is also incredible, it packs tons of text into 64K.
It picks up and follows style from context very well.
It's not unreasonably horny or anything.

It's still slopped, and I'm not sure how it performs at 80K+ yet. TBH jury is still out on intelligence/fandom knowledge and creativity, but I don't feel the need to go back to Star-Lite or anything.

I agree with much of the sentiment on the new command R, though I think its OK if its prompting format is fleshed out and it has a lot of context to draw on.

tdh111

18 days ago

@Downtown-Case

I've had good experiences with EVA, (also very recently tested DeepseekV3 Base and my very early impressions are disappointment).

https://huggingface.co/nbeerbower/EVA-Gutenberg3-Qwen2.5-32B This just came out, from your suggestion. It also seems to be the first fine tune of EVA-32b-v0.2, plenty of merges ( I've tried one that had QwQ, guten, and EVA). I'm going to test it soon, looks interesting.

tdh111

3 days ago

@Downtown-Case

I see you made two merges with it and the R1 distill, did you end up liking one merge over the other. I didn't like Deepseek V3 (and Base), but R1 is really nice, and with some open PR's to llama.cpp, it runs at reasonable speed all the way to high context (only tested up to ~32K though).

Downtown-Case

Owner 3 days ago

•

edited 3 days ago

@tdh111

Not sure which I prefer, or if either is better than pure R1 distil, still testing!

I do know that you need to treat the merges like R1, following the R1 and < think > prompt formatting, otherwise they try to "think" instead of write.

You should also consider running it in exllama if you can, as it handles long context much better, which is especially important for a thinking model that would otherwise take forever to respond.

tdh111

2 days ago

Not sure which I prefer, or if either is better than pure R1 distil, still testing!
You should also consider running it in exllama if you can, as it handles long context much better, which is especially important for a thinking model that would otherwise take forever to respond.

If I end up back on my GPU, I'll try all three on exllama.

I do know that you need to treat the merges like R1, following the R1 and < think > prompt formatting, otherwise they try to "think" instead of write.

R1 has been fun for me, it runs slow (down to ~1 t/s at 30K, from ~3 t/s at very low context) working on improving that. I'm not sure if the distill's also have the same recommendation but removing the thinking as recommended for multi round causes a lot of prompt reprocessing which also slows to a crawl.

From what I've heard the distill's behave like R1, but still feel like the base models and that makes sense as they are just SFT finetunes. R1 on the other hand had RL training, and does not feel like V3. It feels very unique, and why it's been fun.

Also a bit of a random question, what happened to your reddit account?

Downtown-Case

Owner 1 day ago

•

edited 1 day ago

R1 has been fun for me, it runs slow (down to ~1 t/s at 30K, from ~3 t/s at very low context) working on improving that. I'm not sure if the distill's also have the same recommendation but removing the thinking as recommended for multi round causes a lot of prompt reprocessing which also slows to a crawl.

The 32B version, on a 24GB card!? My friend, switch to exllama, its faster than reading speed for me at 30K-50K in tests, more is doable. Prompt processing is much faster too, especially if you allocate a bit of vram for 4096-token chunks and undervolt your card.

I've heard the distill's behave like R1, but still feel like the base models and that makes sense as they are just SFT finetunes.

Agreed, and I'm happy with this because I like base Qwen way more than instruct. EVA-Gutenberg is really good in its writing niche for the same reason, which is why I thought to merge them. But so far the merges... kinda just feel like Deepseek 32B, which is why I'm hesitant to even make comparisons.

I'm not sure if the distill's also have the same recommendation but removing the thinking as recommended for multi round causes a lot of prompt reprocessing which also slows to a crawl.

Yeah, I think this is more for uncached cloud usage, as I've been leaving the < think > sections for a few turns and it seems OK. And it's not like a Mistral model that starts to wobble after 12K.

Also a bit of a random question, what happened to your reddit account?

Heh, I'm surprised anyone noticed. I got randomly banned, I think for posting a few huggingface links in a comment? I can reinstate it by resetting my password, but I don't use Reddit outside /r/locallama anymore, so I haven't gotten around to it yet.

tdh111

1 day ago

•

edited 1 day ago

R1 has been fun for me, it runs slow (down to ~1 t/s at 30K, from ~3 t/s at very low context) working on improving that. I'm not sure if the distill's also have the same recommendation but removing the thinking as recommended for multi round causes a lot of prompt reprocessing which also slows to a crawl.

The 32B version, on a 24GB card!? My friend, switch to exllama, its faster than reading speed for me at 30K-50K in tests, more is doable. Prompt processing is much faster too, especially if you allocate a bit of vram for 4096-token chunks and undervolt your card.

No the 671B one that is running CPU only (hopefully only for now, I know ktransformers shows the potential for partial offload on a Deepseek architecture). I do miss the speed of smaller EVA, but the difference in intelligence, and this fresh style has been fun for me. My 3090 can run a 32B fast fully offloaded on both llama.cpp or exllama. I just prefer ik_llama.cpp (a llama.cpp fork) for SOTA quants.

I've heard the distill's behave like R1, but still feel like the base models and that makes sense as they are just SFT finetunes.

Agreed, and I'm happy with this because I like base Qwen way more than instruct. EVA-Gutenberg is really good in its writing niche for the same reason, which is why I thought to merge them. But so far the merges... kinda just feel like Deepseek 32B, which is why I'm hesitant to even make comparisons.

I haven't tried any of the Distill's been so busy keeping up with optimizations and quanting of R1.

I'm not sure if the distill's also have the same recommendation but removing the thinking as recommended for multi round causes a lot of prompt reprocessing which also slows to a crawl.

Yeah, I think this is more for uncached cloud usage, as I've been leaving the < think > sections for a few turns and it seems OK. And it's not like a Mistral model that starts to wobble after 12K.

I've left them in for a bit, but it was never better and sometimes worse, now I just remove them.

Also a bit of a random question, what happened to your reddit account?

Heh, I'm surprised anyone noticed. I got randomly banned, I think for posting a few huggingface links in a comment? I can reinstate it by resetting my password, but I don't use Reddit outside /r/locallama anymore, so I haven't gotten around to it yet.

You had informational posts, especially about long context.