Granther commited on
Commit
26ba9c4
·
verified ·
1 Parent(s): f37ef52

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -10,10 +10,10 @@ pipeline_tag: text-generation
10
  - 4 Bit Quantized version of Microsoft's Phi3 Mini 128k: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
11
  - Quantized the model with HuggingFace's 🤗 GPTQQuanizer
12
 
13
- ### Phi3 Flash Attention
14
  - The Phi3 family supports Flash Attenion 2, this mechanism allows for faster inference with lower resource use.
15
- - When quantizing Phi3 on a 4090 with Flash Attention disabled the 24 Gigs of VRAM, the VRAM would be maxed out, causing quantizing to fail.
16
- - Enabling Flash Attention allowed quantizing to complete with an extra 10 Giagbaytes of VRAM left on the GPU
17
 
18
  ### Metrics
19
  ###### Total Size:
 
10
  - 4 Bit Quantized version of Microsoft's Phi3 Mini 128k: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
11
  - Quantized the model with HuggingFace's 🤗 GPTQQuanizer
12
 
13
+ ### Flash Attention
14
  - The Phi3 family supports Flash Attenion 2, this mechanism allows for faster inference with lower resource use.
15
+ - When quantizing Phi3 on a 4090 (24G) with Flash Attention disabled Quantization would fail due to insufficient VRAM
16
+ - Enabling Flash Attention allowed Quantization to complete with an extra 10 Giagbaytes of VRAM available on the GPU
17
 
18
  ### Metrics
19
  ###### Total Size: