Thanks!

#4
by MrDevolver - opened

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! πŸ˜‰

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! πŸ˜‰

I am happy you are finding success with the model! This has been the goal, get as close as possible to the large models performance in a small base model so you have both speed and do not have to spend $5k+ on a rig to be able to run the larger models at 20tk/s. A double distill would be interesting but I think that would require distill the 480b into the 235b and then distill that into the 30b. I'm not sure how well that would work but I may look into a double distill.

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! πŸ˜‰

If you downloaded the Q8 quant please re-download, the original one I uploaded was the wrong model. I uploaded the correct one that will perform much better.

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! πŸ˜‰

If you downloaded the Q8 quant please re-download, the original one I uploaded was the wrong model. I uploaded the correct one that will perform much better.

Thanks for a heads up, but I'm only running Q4_K_S, I think that's as high as I can go with my current hardware. I could probably do Q4_K_M, but I guess going higher would significantly slow it down for very little benefit of extra quality.

Still the Q4_K_S was able to one shot actually playable PACMAN clone. That's something you can't do even with some proprietary models!
Funny thing is that the game had couple of small issues, mostly visual ones, but those couldn't be properly fixed even by Gemini 2.5 Pro, Claude 4.1 Opus nor GPT 5 High which are currently considered to be the top among the proprietary models. This made me believe that this little 30B A3B model is REALLY pushing to be the best it could possibly be even at Q4_K_S. Still, I wonder if the Q8 could fix it, but again I can't run that high level of quant myself at the moment.

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! πŸ˜‰

I am happy you are finding success with the model! This has been the goal, get as close as possible to the large models performance in a small base model so you have both speed and do not have to spend $5k+ on a rig to be able to run the larger models at 20tk/s. A double distill would be interesting but I think that would require distill the 480b into the 235b and then distill that into the 30b. I'm not sure how well that would work but I may look into a double distill.

Another thing that helps is the quant process. I haven't looked into the 'why' of it, but MXFP4 really helps with coherence and memory usage.

@lovedheart made an MXFP4 version of Qwen3 32B that I can run at decent speeds on an 8GB VRAM + 32GB RAM system.

Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 : I am running your Q8 and it is a beast. It surprised me so much I came looking for this comments forum so I could tell you. I am not just talking about coding but writing and other tasks. The model performs like it is a much larger model. I have the full Qwen3 Coder 480B at Q6 on another machine so I am able to compare. It has a can-do attitude, excellent composition, and low code error rate. It exhibits a startling degree of "initiative" and intelligence in the way it tackles problems, accepts suggestions, or creates pitch deck slides. Very impressive!

Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 : I am running your Q8 and it is a beast. It surprised me so much I came looking for this comments forum so I could tell you. I am not just talking about coding but writing and other tasks. The model performs like it is a much larger model. I have the full Qwen3 Coder 480B at Q6 on another machine so I am able to compare. It has a can-do attitude, excellent composition, and low code error rate. It exhibits a startling degree of "initiative" and intelligence in the way it tackles problems, accepts suggestions, or creates pitch deck slides. Very impressive!

Thank you I appreciate the kind words!

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! πŸ˜‰

I uploaded a qwen3-30b-a3b-thinking-2507 distill that I distilled deepseek 3.1 into

I have been researching which one to start with and found this one in a youtube channel today. Looking forward to try it in my maxed out mac studio as the very first local model to try.

Also another vote to see Qwen3-235B-A22B distilled into it to keep it in the Qwen family so to speak although I don't know maybe the 480b already supersedes it for code generation ? Any chance of BF16 (or not necessary maybe?). Also curious if someone has been able to do SWE benchmark etc. Planning to use it with ccr as my complete substitute for claude. Also looking to eventually learn how to make a version that will run better on macs (mlx) and 1M context as defaults.

From what I have read most are using glm-4.5-air for coding but I suspect this one will outdo everything out there. Thank you !

I will also compare with your Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill-FP32 and nightmedia/Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill-FP32-q8-mlx

Maybe we can as community come up with the equivalent of the base benchmarks

Would you be able to release the scripts on how you did that exactly to distill it?
for example your last deepseek v3.1 distill , i would love some more detailed instructions, maybe i can even do it myself then.
What also will be interesting (maybe i try it) as we have multiple 30ba3 models now with the exact same architecture, we could use https://huggingface.co/blog/mlabonne/merge-models (merge kit) .... combining this 3 models together would then result in theory in a 30b model that was trained on approx. 45 trillion tokens ...

Assuming we merge every big model:
llama3.1 405B -> qwen3 30b3a (<10 trillion)
glm4.5 -> qwen3 30b3a (unknown xx trillion)
deepseek v3.1 -> qwen3 30b3a (unknown xx trillion)
gpt-oss-120B -> qwen3 30b3a (unkown xx trillion)
QwenCoder -> qwen3 30b3a (7.5T , extended form 15T qwen3 235B)
Grok2 -> qwen3 30b3a (xxT Token trained)
nvidia nemotron ultra -> qwen3 30b3a (30T token trained)
ByteDance Seed OSS -> qwen3 30b3a (12T token trained)

after distilling all this models, running a benchmarks on each model , then we know the mergekit parameters we should use....
and would get a 30b3a model that was "kinda" trained on 80T tokens!!! it would in theory put all other models into dust, lol :-)

I have been researching which one to start with and found this one in a youtube channel today. Looking forward to try it in my maxed out mac studio as the very first local model to try.

Also another vote to see Qwen3-235B-A22B distilled into it to keep it in the Qwen family so to speak although I don't know maybe the 480b already supersedes it for code generation ? Any chance of BF16 (or not necessary maybe?). Also curious if someone has been able to do SWE benchmark etc. Planning to use it with ccr as my complete substitute for claude. Also looking to eventually learn how to make a version that will run better on macs (mlx) and 1M context as defaults.

From what I have read most are using glm-4.5-air for coding but I suspect this one will outdo everything out there. Thank you !

I will also compare with your Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill-FP32 and nightmedia/Qwen3-30B-A3B-Thinking-2507-Deepseek-v3.1-Distill-FP32-q8-mlx

Maybe we can as community come up with the equivalent of the base benchmarks

What youtube channel did you find it in?

What youtube channel did you find it in?

It was in the comments section of this video (watch?v=HQ7dNWqjv7E)

"@jaycampbell2706
11 days ago
Thanks for another great video! I am at the very early stages of testing, but I am having some good results with a variant of Qwen3 Coder called Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 on LM Studio at Q8.

@akierum
11 days ago
This model beats the qwen3-coder 30b unsloth Q8-XL, qwen3-coder 30b official, and official MistralAi Devstral vesion. Great find, how did you do it? Any other hints of better models and where to find them??"

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! πŸ˜‰

I uploaded a qwen3-30b-a3b-thinking-2507 distill that I distilled deepseek 3.1 into

Hey, thank you so much for your efforts, I really appreciate it! I was super excited about this, but for some reason this one did not work for me so well, unlike the previous big coder distill. Could you please share the parameters and even system prompt (if any) that you're using yourself? Because no matter what I tried, the big coder distill produced fairly better results (and faster too since it didn't have to think). πŸ₯Ί

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! πŸ˜‰

I uploaded a qwen3-30b-a3b-thinking-2507 distill that I distilled deepseek 3.1 into

Hey, thank you so much for your efforts, I really appreciate it! I was super excited about this, but for some reason this one did not work for me so well, unlike the previous big coder distill. Could you please share the parameters and even system prompt (if any) that you're using yourself? Because no matter what I tried, the big coder distill produced fairly better results (and faster too since it didn't have to think). πŸ₯Ί

Did you use it for coding related tasks? Because the coder distill will work much better for coding since that model is specialized in coding.

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! πŸ˜‰

I uploaded a qwen3-30b-a3b-thinking-2507 distill that I distilled deepseek 3.1 into

Hey, thank you so much for your efforts, I really appreciate it! I was super excited about this, but for some reason this one did not work for me so well, unlike the previous big coder distill. Could you please share the parameters and even system prompt (if any) that you're using yourself? Because no matter what I tried, the big coder distill produced fairly better results (and faster too since it didn't have to think). πŸ₯Ί

Did you use it for coding related tasks? Because the coder distill will work much better for coding since that model is specialized in coding.

Yes, I tested some coding related prompts. Same ones I did with the Coder distill. So the DeepSeek distill is not meant to be good at coding? What is it good for then? I'm confused. I see people usually praise DeepSeek for coding abilities.

This is actually pretty useful, thanks!
Any chance to distill Qwen 3 235B into Qwen 3 30B A3B too? πŸ€”

I may do this in the future, currently seeing what other optimizations can be done in the distillation process to get more knowledge distillation into the student model.

Fair enough. This was surprisingly useful. Honestly? Some things it nailed better than bigger proprietary models. Not sure how much of that intelligence is from distillation / mixing the models and how much of that is just the base model quality, but I always felt like the base model is lacking. This one seems to be pushing it a bit further and I'm honestly loving it so far. It's not always perfect, but when using this one, I feel like I have much bigger model at my hands than just 30B A3B and it feels so good lol. That's why I'd love to see what could you do with the 235B model too. I feel like you have a good model recipe there! Heck, maybe you could even combine them both (the big coder + 235B both distilled into 30B A3B)? That would be fun! πŸ˜‰

I uploaded a qwen3-30b-a3b-thinking-2507 distill that I distilled deepseek 3.1 into

Hey, thank you so much for your efforts, I really appreciate it! I was super excited about this, but for some reason this one did not work for me so well, unlike the previous big coder distill. Could you please share the parameters and even system prompt (if any) that you're using yourself? Because no matter what I tried, the big coder distill produced fairly better results (and faster too since it didn't have to think). πŸ₯Ί

Did you use it for coding related tasks? Because the coder distill will work much better for coding since that model is specialized in coding.

Yes, I tested some coding related prompts. Same ones I did with the Coder distill. So the DeepSeek distill is not meant to be good at coding? What is it good for then? I'm confused. I see people usually praise DeepSeek for coding abilities.

The deepseek v3.1 distill is made to be a general model better at all tasks than the base 30b model. I believe the large qwen3 coder 480b model is better at coding than the large deepseek v3.1 685b model. So the coding specialized distill will naturally be better for coding than the deepseek distill. The goal of the deepseek distill is to be better at all tasks and to be a much smarter all around model than its base 30b model.

Tested a few models for the first time in lmstudio with the same prompt as this one https://forums.macrumors.com/threads/mac-studio-m3-ultra-96gb-28-60-llm-performance.2456559/

Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 is just above 70 tokens/sec , Total 288 tokens , 0.26 to first token. The full 262k context when enabled uses an extra 22GB vRAM for a total of 54.70GB

Just started using it in claude code and it is painfully slow. I need to figure out whats up or serve via something else. I don't think the proxy (claude code router) is the bottleneck as the exact same question also took quite a long time when used in claude code and I could see the "Generating" in Developer menu spin forever. Worth noting that claude code router runs locally on the mac and claude code is remote in a linux node. Still seems to be an issue with the lmstudio+model + claude code combo .....needs further testing...

opencode (sst, open source version) on the other hand is blazing quick as it also supported lmstudio directly in its provider config ....and I also ran it locally ... I just can't trust it enough to switch over yet.

Tested a few models for the first time in lmstudio with the same prompt as this one https://forums.macrumors.com/threads/mac-studio-m3-ultra-96gb-28-60-llm-performance.2456559/

Qwen3-Coder-30B-A3B-Instruct-480B-Distill-V2 is just above 70 tokens/sec , Total 288 tokens , 0.26 to first token. The full 262k context when enabled uses an extra 22GB vRAM for a total of 54.70GB

Just started using it in claude code and it is painfully slow. I need to figure out whats up or serve via something else. I don't think the proxy (claude code router) is the bottleneck as the exact same question also took quite a long time when used in claude code and I could see the "Generating" in Developer menu spin forever. Worth noting that claude code router runs locally on the mac and claude code is remote in a linux node. Still seems to be an issue with the lmstudio+model + claude code combo .....needs further testing...

opencode (sst, open source version) on the other hand is blazing quick as it also supported lmstudio directly in its provider config ....and I also ran it locally ... I just can't trust it enough to switch over yet.

you could try the qwen coding agent

Here is llama bench

 llama-bench -m Qwen3-30B-A3B-Instruct-Coder-480B-Distill-v2-Q8_0.gguf 

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Metal,BLAS |      24 |           pp512 |      2159.17 Β± 11.09 |
| qwen3moe 30B.A3B Q8_0          |  30.25 GiB |    30.53 B | Metal,BLAS |      24 |           tg128 |         70.79 Β± 0.12 |

 llama-bench -m  unsloth_GLM-4.5-Air-GGUF_UD-Q8_K_XL_GLM-4.5-Air-UD-Q8_K_XL-00001-of-00003.gguf

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| glm4moe 106B.A12B Q8_0         | 118.96 GiB |   110.47 B | Metal,BLAS |      24 |           pp512 |        604.53 Β± 0.92 |
| glm4moe 106B.A12B Q8_0         | 118.96 GiB |   110.47 B | Metal,BLAS |      24 |           tg128 |         28.87 Β± 0.02 |

 llama-bench -m  Qwen3-Coder-480B-A35B-Instruct-UD-Q4_K_XL-00001-of-00006.gguf

| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3moe ?B Q4_K - Medium      | 256.62 GiB |   480.15 B | Metal,BLAS |      24 |           pp512 |        210.06 Β± 1.38 |
| qwen3moe ?B Q4_K - Medium      | 256.62 GiB |   480.15 B | Metal,BLAS |      24 |           tg128 |         21.48 Β± 0.01 |

Is it possible to use mxfp4 quantization in the future? In theory, it stores information better than the similar-sized q4_k_s quantization and works with hardware acceleration on Blackwell-based graphics cards

you could try the qwen coding agent

Qwen is faster. I overrode the initial prompt in claude to match the performance. But it seems even with all the VRAM available and the mac studio bandwidth 800 Gb/s , this is not going to substitute as the core model for agentic work. We might still be quite a while away. It may be enough for programmatic request/response or via simple chat interfaces.

Sign up or log in to comment