Parker commited on
Commit
467e717
·
verified ·
1 Parent(s): 6f5702f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +97 -3
README.md CHANGED
@@ -1,3 +1,97 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Here's some quantized VibeVoice 7b models, both 8 and 4 bit, along with some simple python code to test them out.
2
+
3
+ ## Model Sizes
4
+
5
+ | Model Version | Size | Memory Usage | Quality |
6
+ |---------------|------|--------------|---------|
7
+ | Original (fp16/bf16) | 18GB | ~18GB VRAM | Best |
8
+ | 8-bit Quantized | 9.9GB | ~10.6GB VRAM | Excellent |
9
+ | 4-bit Quantized (nf4) | 6.2GB | ~6.6GB VRAM | Very Good |
10
+
11
+ ## How to Use Pre-Quantized Models
12
+
13
+ ### 1. Loading 4-bit Model
14
+
15
+ ```python
16
+ from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
17
+ from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
18
+
19
+ # Load pre-quantized 4-bit model
20
+ model_path = "/path/to/VibeVoice-Large-4bit"
21
+ processor = VibeVoiceProcessor.from_pretrained(model_path)
22
+ model = VibeVoiceForConditionalGenerationInference.from_pretrained(
23
+ model_path,
24
+ device_map='cuda',
25
+ torch_dtype=torch.bfloat16,
26
+ )
27
+ ```
28
+
29
+ ### 2. Loading 8-bit Model
30
+
31
+ ```python
32
+ # Same code, just point to 8-bit model
33
+ model_path = "/path/to/VibeVoice-Large-8bit"
34
+ # ... rest is the same
35
+ ```
36
+
37
+ ## Creating Your Own Quantized Models
38
+
39
+ Use the provided script to quantize models:
40
+
41
+ ```bash
42
+ # 4-bit quantization (nf4)
43
+ python quantize_and_save_vibevoice.py \
44
+ --model_path /path/to/original/model \
45
+ --output_dir /path/to/output/4bit \
46
+ --bits 4 \
47
+ --test
48
+
49
+ # 8-bit quantization
50
+ python quantize_and_save_vibevoice.py \
51
+ --model_path /path/to/original/model \
52
+ --output_dir /path/to/output/8bit \
53
+ --bits 8 \
54
+ --test
55
+ ```
56
+
57
+ ## Benefits
58
+
59
+ 1. **Pre-quantized models load faster** - No on-the-fly quantization needed
60
+ 2. **Lower VRAM requirements** - 4-bit uses only ~6.6GB vs 18GB
61
+ 3. **Shareable** - Upload the quantized folder to share with others
62
+ 4. **Quality preserved** - nf4 quantization maintains excellent output quality
63
+
64
+ ## Distribution
65
+
66
+ To share quantized models:
67
+
68
+ 1. Upload the entire quantized model directory (e.g., `VibeVoice-Large-4bit/`)
69
+ 2. Include the `quantization_config.json` file (automatically created)
70
+ 3. Users can load directly without any quantization setup
71
+
72
+ ## Performance Notes
73
+
74
+ - 4-bit (nf4): Best for memory-constrained systems, minimal quality loss
75
+ - 8-bit: Better quality than 4-bit, still significant memory savings
76
+ - Both versions maintain the same generation speed as the original
77
+ - Flash Attention 2 is supported in all quantized versions
78
+
79
+ ## Troubleshooting
80
+
81
+ If loading fails:
82
+ 1. Ensure you have `bitsandbytes` installed: `pip install bitsandbytes`
83
+ 2. Make sure you're on a CUDA-capable GPU
84
+ 3. Check that all model files are present in the directory
85
+
86
+ ## Files Created
87
+
88
+ Each quantized model directory contains:
89
+ - `model.safetensors.*` - Quantized model weights
90
+ - `config.json` - Model configuration with quantization settings
91
+ - `quantization_config.json` - Specific quantization parameters
92
+ - `processor/` - Audio processor files
93
+ - `load_quantized_Xbit.py` - Example loading script
94
+
95
+ ---
96
+ license: mit
97
+ ---