Any value in static quants?

#688
by SuperNintendoChalmers - opened

This isn't a model request, but a discussion.

Your team provides an extremely valuable service to the LLM community. Many people do not have the knowledge or ability to generate GGUF quants from safetensors themselves. It's always great to find a new model to try and see that a high quality quant team like bartowski or mradermacher has already been and done the hard work.

However, this is also a huge job for you. Even with your obvious automation, compute and storage are finite, people are making increasingly more models every day. The community would like your work to be as sustainable as possible in terms of both workload and resources.

You are obviously aware that imatrix quants are typically of higher quality than static quants. The usual generalisation is that you get "one quant higher" quality by using imatrix. As far as I understand, there is little to no disadvantage to imatrix.

Considering imat seems universally better, is there any value in providing static quants anymore?

Avoiding static quants and only producing imatrix quants could reduce your compute and storage requirements by as much as half. The only static quant people are really interested in is Q8_0. Dropping half the resources could also free you up to produce more valuable quants like Qn_K_L which help people eke out a little more quality or Qn_K_S which help to fit a bigger quant on a GPU than plain Qn_K can do.

This is how other popular quantizers like bartowski have operated for a long time. It would be reasonable for your team to do this as well.

Hopefully this is a useful topic for you. Thank you for your wonderful work and all the best!

Your team provides an extremely valuable service to the LLM community. Many people do not have the knowledge or ability to generate GGUF quants from safetensors themselves. It's always great to find a new model to try and see that a high quality quant team like bartowski or mradermacher has already been and done the hard work.

Thanks a lot. I highly appreciate this.

However, this is also a huge job for you. Even with your obvious automation, compute and storage are finite, people are making increasingly more models every day. The community would like your work to be as sustainable as possible in terms of both workload and resources.

Doing all this work indeed requires a ton of time and resources. We currently have 7 servers almost solely working on this. We uploaded petabytes of quants and are currently the largest HuggingFace uploader booth by the number of models and combined size. I wouldn't say people are making increasingly more models every day. In fact, models seem to come in waves and seam to depend on the season with most models getting released in the first quarter of a year. We are currently keeping up quantizing all the high-quality models getting released but slightly more and they would queue up. In case you wonder why the queue is so large this is because we are actually going back in time and quantizing all the high-quality models ever released. mradermacher is personally hand selecting all models worth quantizing which requires a massive amount of time.

You are obviously aware that imatrix quants are typically of higher quality than static quants. The usual generalisation is that you get "one quant higher" quality by using imatrix. As far as I understand, there is little to no disadvantage to imatrix.
Considering imat seems universally better, is there any value in providing static quants anymore?

I would say it is wrong to say that imatrix quants have no disadvantages. They do if you use the model for things not in our imatrix training set. This mainly involves quality of non-English languages as our imatrix dataset mainly consists of English training data.

Avoiding static quants and only producing imatrix quants could reduce your compute and storage requirements by as much as half.

Not really. We have double the imatrix quants compared to static quants. We do 24 imatrix quants but only 12 static quants.

Keep in mind that the cost of generating imatrix quants is much greater than for static quants as well. For us to provide imatrix quants we first have to compute the imatrix. This is all done on my nico1 node using 512 GiB of RAM and 2x RTX 4090 GPU. We are quite perfectionistic so our training dataset not only consists of what bartowski uses but contains a lot of additional proprietary high-quality data so we get higher quality imatrix quants then him. However this also means imatrix training requires more GPU resource.

The only static quant people are really interested in is Q8_0.

Q8_0 is mostly placebo. Users just have the feeling of running the model in higher quality but measurements showed that the difference is way too small to have and meaningful real-world impact. I see no reason why anyone would run the slower Q8_0 over i1-Q6_K. Some think they can hear the difference between FLAC and AAC 320 kbps and the same applies to Q8_0 but it is likely just all in your mind. As a general rule I would only use Q8_0 for tiny models of 1B or smaller. For 1B to 8B I recommend i1-Q6_K and for higher i1-Q5_K_M or whatever largest quant you can run when sorting them by quality based on https://hf.tst.eu/model#DeepSeek-R1-Distill-Qwen-14B-Uncensored-GGUF.

Dropping half the resources could also free you up to produce more valuable quants like Qn_K_L which help people eke out a little more quality or Qn_K_S which help to fit a bigger quant on a GPU than plain Qn_K can do.
This is how other popular quantizers like bartowski have operated for a long time. It would be reasonable for your team to do this as well.

We always carefully evaluate which quants we do. In fact, I spent multiple months measuring booth the quality and performance of all the different quants. I'm quite confident that I did the most in-depth comparison between quants as I spent far over 500 GPU hours measuring and benchmarking all of them. Please take a look at the quality column on https://hf.tst.eu/model#DeepSeek-R1-Distill-Qwen-14B-Uncensored-GGUF and you will realize how useless Qn_K_L quants are. If you want to take a look at the raw data you can download it under https://www.nicobosshard.ch/LLM-Eval_Quality_v1.tar.zst for quality and https://www.nicobosshard.ch/perfData.zip for performance measurements. A lot of nice visualization of this data you can find under https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/2 (and other BabyHercules discussions).

Hopefully this is a useful topic for you.

This is indeed a very interesting topic we discussed many times in the past and it is for sure worth revisiting from times to times.

Thank you for your wonderful work and all the best!

Thanks for providing your thoughts! I wish you the best as well.

Thanks for your kind, detailed, and informative response. I learnt some cool things here. Much appreciated!

I wasn't aware your imatrix dataset was further improved from bartowski's. I will replace as many of my random GGUF downloads as I can with yours.

You are obviously well on top of this topic so I will close this discussion. Have a great day!

SuperNintendoChalmers changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment