Granther
/

Phi3-128k-Instruct-4Bit-GPTQ

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions Community

Granther commited on Jun 28, 2024

Commit

26ba9c4

·

verified ·

1 Parent(s): f37ef52

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -10,10 +10,10 @@ pipeline_tag: text-generation
 - 4 Bit Quantized version of Microsoft's Phi3 Mini 128k: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
 - Quantized the model with HuggingFace's 🤗 GPTQQuanizer
-### Phi3 Flash Attention
 - The Phi3 family supports Flash Attenion 2, this mechanism allows for faster inference with lower resource use.
-- When quantizing Phi3 on a 4090 with Flash Attention disabled the 24 Gigs of VRAM, the VRAM would be maxed out, causing quantizing to fail.
-- Enabling Flash Attention allowed quantizing to complete with an extra 10 Giagbaytes of VRAM left on the GPU
 ### Metrics
 ###### Total Size:

 - 4 Bit Quantized version of Microsoft's Phi3 Mini 128k: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
 - Quantized the model with HuggingFace's 🤗 GPTQQuanizer
+### Flash Attention
 - The Phi3 family supports Flash Attenion 2, this mechanism allows for faster inference with lower resource use.
+- When quantizing Phi3 on a 4090 (24G) with Flash Attention disabled Quantization would fail due to insufficient VRAM
+- Enabling Flash Attention allowed Quantization to complete with an extra 10 Giagbaytes of VRAM available on the GPU
 ### Metrics
 ###### Total Size: