Papers
arxiv:2503.01483

KurTail : Kurtosis-based LLM Quantization

Published on Mar 3
Authors:
,
,
,

Abstract

One of the challenges of quantizing a large language model (LLM) is the presence of outliers. Outliers often make uniform <PRE_TAG>quantization schemes</POST_TAG> less effective, particularly in extreme cases such as 4-bit quantization. We introduce KurTail, a new post-training <PRE_TAG>quantization (PTQ)</POST_TAG> scheme that leverages <PRE_TAG>Kurtosis-based rotation</POST_TAG> to mitigate outliers in the activations of LLMs. Our method optimizes Kurtosis as a measure of tailedness. This approach enables the <PRE_TAG>quantization of weights</POST_TAG>, activations, and the KV cache in 4 bits. We utilize layer-wise optimization, ensuring memory efficiency. KurTail outperforms existing quantization methods, offering a 13.3\% boost in MMLU accuracy and a 15.5\% drop in Wiki perplexity compared to QuaRot. It also outperforms SpinQuant with a 2.6\% MMLU gain and reduces perplexity by 2.9\%, all while reducing the training cost. For comparison, learning the rotation using SpinQuant for Llama3-70B requires at least four NVIDIA H100 80GB GPUs, whereas our method requires only a single GPU, making it a more accessible solution for consumer GPU.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2503.01483 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2503.01483 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2503.01483 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.