treehugg3 commited on
Commit
0428d78
·
verified ·
1 Parent(s): 193aef9

Add imatrix computing tips

Browse files

As a follow-up to https://huggingface.co/mradermacher/Meta-Llama-3.1-70B-i1-GGUF/discussions/1 and, hopefully, partial redemption for not having located the FAQ before, here's some extra imatrix computing tips that would be nice to add

Files changed (1) hide show
  1. README.md +15 -0
README.md CHANGED
@@ -142,6 +142,21 @@ and then run another command which handles download/computation/upload. Most of
142
  to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use,
143
  is unfortunately very frequent).
144
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
  ## Why don't you use gguf-split?
146
 
147
  TL;DR: I don't have the hardware/resources for that.
 
142
  to do stuff when things go wrong (which, with llama.cpp being so buggy and hard to use,
143
  is unfortunately very frequent).
144
 
145
+ ## What do I need to do to compute imatrix files for large models?
146
+
147
+ ### Hardware
148
+
149
+ * RAM: A lot of RAM is required to compute imatrix files. Example: 512 GB is just enough to compute 405B imatrix quants in Q8.
150
+ * GPU: At least 8 GB of memory.
151
+
152
+ ### Extra tips
153
+
154
+ * Computing 405B imatrix quants in Q8 does not seem to have any noticeable quality impact compared to BF16, so to save on hardware
155
+ requirements, use Q8.
156
+ * Sometimes, a single node may not have enough RAM to compute the imatrix file. In such cases, `llama-rpc` inside llama.cpp can
157
+ be used to combine the RAM/VRAM of multiple nodes. This approach takes longer: computing the 405B imatrix file in BF16 takes
158
+ around 20 hours using 3 nodes with 512 GB, 256 GB, and 128 GB of RAM, compared to 4 hours for Q8 on a single node.
159
+
160
  ## Why don't you use gguf-split?
161
 
162
  TL;DR: I don't have the hardware/resources for that.