Add imatrix computing tips

#745
by treehugg3 - opened

As a follow-up to https://huggingface.co/mradermacher/Meta-Llama-3.1-70B-i1-GGUF/discussions/1 and, hopefully, partial redemption for not having located the FAQ before, here's some extra imatrix computing tips that would be nice to add:

Computing Imatrix Files for Large Models

Hardware

  • RAM: A lot of RAM is required to compute imatrix files. Example: 512 GB is just enough to compute 405B imatrix quants in Q8.
  • GPU: At least 8 GB of memory.

Extra tips

  • Computing 405B imatrix quants in Q8 does not seem to have any noticeable quality impact compared to BF16, so to save on hardware requirements, use Q8.
  • Sometimes, a single node may not have enough RAM to compute the imatrix file. In such cases, llama-rpc inside llama.cpp can be used to combine the RAM/VRAM of multiple nodes. This approach takes longer: computing the 405B imatrix file in BF16 takes around 20 hours using 3 nodes with 512 GB, 256 GB, and 128 GB of RAM, compared to 4 hours for Q8 on a single node.
mradermacher changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment