Update README.md
Browse files
README.md
CHANGED
@@ -10,10 +10,10 @@ pipeline_tag: text-generation
|
|
10 |
- 4 Bit Quantized version of Microsoft's Phi3 Mini 128k: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
|
11 |
- Quantized the model with HuggingFace's 🤗 GPTQQuanizer
|
12 |
|
13 |
-
###
|
14 |
- The Phi3 family supports Flash Attenion 2, this mechanism allows for faster inference with lower resource use.
|
15 |
-
- When quantizing Phi3 on a 4090 with Flash Attention disabled
|
16 |
-
- Enabling Flash Attention allowed
|
17 |
|
18 |
### Metrics
|
19 |
###### Total Size:
|
|
|
10 |
- 4 Bit Quantized version of Microsoft's Phi3 Mini 128k: https://huggingface.co/microsoft/Phi-3-mini-128k-instruct
|
11 |
- Quantized the model with HuggingFace's 🤗 GPTQQuanizer
|
12 |
|
13 |
+
### Flash Attention
|
14 |
- The Phi3 family supports Flash Attenion 2, this mechanism allows for faster inference with lower resource use.
|
15 |
+
- When quantizing Phi3 on a 4090 (24G) with Flash Attention disabled Quantization would fail due to insufficient VRAM
|
16 |
+
- Enabling Flash Attention allowed Quantization to complete with an extra 10 Giagbaytes of VRAM available on the GPU
|
17 |
|
18 |
### Metrics
|
19 |
###### Total Size:
|