BasedBase/Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2

16 days ago

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? 🤔

Owner 16 days ago

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

MrDevolver

16 days ago

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! 😉

BasedBase

Owner 16 days ago

I am happy you are finding success with the model! This has been the goal, get as close as possible to the large models performance in a small base model so you have both speed and do not have to spend $5k+ on a rig to be able to run the larger models at 20tk/s. A double distill would be interesting but I think that would require distill the 480b into the 235b and then distill that into the 30b. I'm not sure how well that would work but I may look into a double distill.

BasedBase

Owner 16 days ago

If you downloaded the Q8 quant please re-download, the original one I uploaded was the wrong model. I uploaded the correct one that will perform much better.

MrDevolver

15 days ago

Thanks for a heads up, but I'm only running Q4_K_S, I think that's as high as I can go with my current hardware. I could probably do Q4_K_M, but I guess going higher would significantly slow it down for very little benefit of extra quality.

Still the Q4_K_S was able to one shot actually playable PACMAN clone. That's something you can't do even with some proprietary models!
Funny thing is that the game had couple of small issues, mostly visual ones, but those couldn't be properly fixed even by Gemini 2.5 Pro, Claude 4.1 Opus nor GPT 5 High which are currently considered to be the top among the proprietary models. This made me believe that this little 30B A3B model is REALLY pushing to be the best it could possibly be even at Q4_K_S. Still, I wonder if the Q8 could fix it, but again I can't run that high level of quant myself at the moment.

Koitenshin

14 days ago

Another thing that helps is the quant process. I haven't looked into the 'why' of it, but MXFP4 really helps with coherence and memory usage.

@lovedheart made an MXFP4 version of Qwen3 32B that I can run at decent speeds on an 8GB VRAM + 32GB RAM system.

MegatronMyLeader

12 days ago

•

edited 12 days ago

Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 : I am running your Q8 and it is a beast. It surprised me so much I came looking for this comments forum so I could tell you. I am not just talking about coding but writing and other tasks. The model performs like it is a much larger model. I have the full Qwen3 Coder 480B at Q6 on another machine so I am able to compare. It has a can-do attitude, excellent composition, and low code error rate. It exhibits a startling degree of "initiative" and intelligence in the way it tackles problems, accepts suggestions, or creates pitch deck slides. Very impressive!

BasedBase

Owner 12 days ago

Thank you I appreciate the kind words!

BasedBase

Owner 6 days ago

I uploaded a qwen3-30b-a3b-thinking-2507 distill that I distilled deepseek 3.1 into

tifoji

4 days ago

•

edited 4 days ago

I have been researching which one to start with and found this one in a youtube channel today. Looking forward to try it in my maxed out mac studio as the very first local model to try.

Also another vote to see Qwen3-235B-A22B distilled into it to keep it in the Qwen family so to speak although I don't know maybe the 480b already supersedes it for code generation ? Any chance of BF16 (or not necessary maybe?). Also curious if someone has been able to do SWE benchmark etc. Planning to use it with ccr as my complete substitute for claude. Also looking to eventually learn how to make a version that will run better on macs (mlx) and 1M context as defaults.

From what I have read most are using glm-4.5-air for coding but I suspect this one will outdo everything out there. Thank you !

I will also compare with your Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill-FP32 and nightmedia/Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill-FP32-q8-mlx

Maybe we can as community come up with the equivalent of the base benchmarks

snapo

4 days ago

•

edited 4 days ago

Would you be able to release the scripts on how you did that exactly to distill it?
for example your last deepseek v3.1 distill , i would love some more detailed instructions, maybe i can even do it myself then.
What also will be interesting (maybe i try it) as we have multiple 30ba3 models now with the exact same architecture, we could use https://huggingface.co/blog/mlabonne/merge-models (merge kit) .... combining this 3 models together would then result in theory in a 30b model that was trained on approx. 45 trillion tokens ...

Assuming we merge every big model:
llama3.1 405B -> qwen3 30b3a (<10 trillion)
glm4.5 -> qwen3 30b3a (unknown xx trillion)
deepseek v3.1 -> qwen3 30b3a (unknown xx trillion)
gpt-oss-120B -> qwen3 30b3a (unkown xx trillion)
QwenCoder -> qwen3 30b3a (7.5T , extended form 15T qwen3 235B)
Grok2 -> qwen3 30b3a (xxT Token trained)
nvidia nemotron ultra -> qwen3 30b3a (30T token trained)
ByteDance Seed OSS -> qwen3 30b3a (12T token trained)

after distilling all this models, running a benchmarks on each model , then we know the mergekit parameters we should use....
and would get a 30b3a model that was "kinda" trained on 80T tokens!!! it would in theory put all other models into dust, lol :-)

BasedBase

Owner 4 days ago

What youtube channel did you find it in?

tifoji

3 days ago

It was in the comments section of this video (watch?v=HQ7dNWqjv7E)

"@jaycampbell2706
11 days ago
Thanks for another great video! I am at the very early stages of testing, but I am having some good results with a variant of Qwen3 Coder called Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 on LM Studio at Q8.

@akierum
11 days ago
This model beats the qwen3-coder 30b unsloth Q8-XL, qwen3-coder 30b official, and official MistralAi Devstral vesion. Great find, how did you do it? Any other hints of better models and where to find them??"

MrDevolver

3 days ago

Hey, thank you so much for your efforts, I really appreciate it! I was super excited about this, but for some reason this one did not work for me so well, unlike the previous big coder distill. Could you please share the parameters and even system prompt (if any) that you're using yourself? Because no matter what I tried, the big coder distill produced fairly better results (and faster too since it didn't have to think). 🥺

BasedBase

Owner 3 days ago

Did you use it for coding related tasks? Because the coder distill will work much better for coding since that model is specialized in coding.

MrDevolver

3 days ago

Yes, I tested some coding related prompts. Same ones I did with the Coder distill. So the DeepSeek distill is not meant to be good at coding? What is it good for then? I'm confused. I see people usually praise DeepSeek for coding abilities.

BasedBase

Owner 3 days ago

The deepseek v3.1 distill is made to be a general model better at all tasks than the base 30b model. I believe the large qwen3 coder 480b model is better at coding than the large deepseek v3.1 685b model. So the coding specialized distill will naturally be better for coding than the deepseek distill. The goal of the deepseek distill is to be better at all tasks and to be a much smarter all around model than its base 30b model.

tifoji

2 days ago

Tested a few models for the first time in lmstudio with the same prompt as this one https://forums.macrumors.com/threads/mac-studio-m3-ultra-96gb-28-60-llm-performance.2456559/

Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 is just above 70 tokens/sec , Total 288 tokens , 0.26 to first token. The full 262k context when enabled uses an extra 22GB vRAM for a total of 54.70GB

Just started using it in claude code and it is painfully slow. I need to figure out whats up or serve via something else. I don't think the proxy (claude code router) is the bottleneck as the exact same question also took quite a long time when used in claude code and I could see the "Generating" in Developer menu spin forever. Worth noting that claude code router runs locally on the mac and claude code is remote in a linux node. Still seems to be an issue with the lmstudio+model + claude code combo .....needs further testing...

opencode (sst, open source version) on the other hand is blazing quick as it also supported lmstudio directly in its provider config ....and I also ran it locally ... I just can't trust it enough to switch over yet.

BasedBase

Owner 2 days ago

you could try the qwen coding agent

tifoji

1 day ago

Here is llama bench

 llama-bench -m Qwen3-30B-A3B-Instruct-Coder-480B-Distill-v2-Q8_0.gguf 

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Metal,BLAS |      24 |           pp512 |      2159.17 ± 11.09 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Metal,BLAS |      24 |           tg128 |         70.79 ± 0.12 |

 llama-bench -m  unsloth_GLM-4.5-Air-GGUF_UD-Q8_K_XL_GLM-4.5-Air-UD-Q8_K_XL-00001-of-00003.gguf

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| glm4moe 106B.A12B Q8_0         | 118.96 GiB |   110.47 B | Metal,BLAS |      24 |           pp512 |        604.53 ± 0.92 |
| glm4moe 106B.A12B Q8_0         | 118.96 GiB |   110.47 B | Metal,BLAS |      24 |           tg128 |         28.87 ± 0.02 |

 llama-bench -m  Qwen3-Coder-480B-A35B-Instruct-UD-Q4_K_XL-00001-of-00006.gguf

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe ?B Q4_K - Medium      | 256.62 GiB |   480.15 B | Metal,BLAS |      24 |           pp512 |        210.06 ± 1.38 |
| qwen3moe ?B Q4_K - Medium      | 256.62 GiB |   480.15 B | Metal,BLAS |      24 |           tg128 |         21.48 ± 0.01 |

NIK2703

1 day ago

Is it possible to use mxfp4 quantization in the future? In theory, it stores information better than the similar-sized q4_k_s quantization and works with hardware acceleration on Blackwell-based graphics cards

tifoji

1 day ago

Qwen is faster. I overrode the initial prompt in claude to match the performance. But it seems even with all the VRAM available and the mac studio bandwidth 800 Gb/s , this is not going to substitute as the core model for agentic work. We might still be quite a while away. It may be enough for programmatic request/response or via simple chat interfaces.

BasedBase
/

Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2

Thanks!