DevParker
/

VibeVoice7b-low-vram

speech-synthesis

Model card Files Files and versions

Parker commited on 4 days ago

Commit

467e717

·

verified ·

1 Parent(s): 6f5702f

Update README.md

Files changed (1) hide show

README.md +97 -3

README.md CHANGED Viewed

@@ -1,3 +1,97 @@
----
-license: mit
----

+Here's some quantized VibeVoice 7b models, both 8 and 4 bit, along with some simple python code to test them out.
+## Model Sizes
+| Model Version | Size | Memory Usage | Quality |
+|---------------|------|--------------|---------|
+| Original (fp16/bf16) | 18GB | ~18GB VRAM | Best |
+| 8-bit Quantized | 9.9GB | ~10.6GB VRAM | Excellent |
+| 4-bit Quantized (nf4) | 6.2GB | ~6.6GB VRAM | Very Good |
+## How to Use Pre-Quantized Models
+### 1. Loading 4-bit Model
+```python
+from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
+from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
+# Load pre-quantized 4-bit model
+model_path = "/path/to/VibeVoice-Large-4bit"
+processor = VibeVoiceProcessor.from_pretrained(model_path)
+model = VibeVoiceForConditionalGenerationInference.from_pretrained(
+    model_path,
+    device_map='cuda',
+    torch_dtype=torch.bfloat16,
+)
+```
+### 2. Loading 8-bit Model
+```python
+# Same code, just point to 8-bit model
+model_path = "/path/to/VibeVoice-Large-8bit"
+# ... rest is the same
+```
+## Creating Your Own Quantized Models
+Use the provided script to quantize models:
+```bash
+# 4-bit quantization (nf4)
+python quantize_and_save_vibevoice.py \
+    --model_path /path/to/original/model \
+    --output_dir /path/to/output/4bit \
+    --bits 4 \
+    --test
+# 8-bit quantization
+python quantize_and_save_vibevoice.py \
+    --model_path /path/to/original/model \
+    --output_dir /path/to/output/8bit \
+    --bits 8 \
+    --test
+```
+## Benefits
+1. **Pre-quantized models load faster** - No on-the-fly quantization needed
+2. **Lower VRAM requirements** - 4-bit uses only ~6.6GB vs 18GB
+3. **Shareable** - Upload the quantized folder to share with others
+4. **Quality preserved** - nf4 quantization maintains excellent output quality
+## Distribution
+To share quantized models:
+1. Upload the entire quantized model directory (e.g., `VibeVoice-Large-4bit/`)
+2. Include the `quantization_config.json` file (automatically created)
+3. Users can load directly without any quantization setup
+## Performance Notes
+- 4-bit (nf4): Best for memory-constrained systems, minimal quality loss
+- 8-bit: Better quality than 4-bit, still significant memory savings
+- Both versions maintain the same generation speed as the original
+- Flash Attention 2 is supported in all quantized versions
+## Troubleshooting
+If loading fails:
+1. Ensure you have `bitsandbytes` installed: `pip install bitsandbytes`
+2. Make sure you're on a CUDA-capable GPU
+3. Check that all model files are present in the directory
+## Files Created
+Each quantized model directory contains:
+- `model.safetensors.*` - Quantized model weights
+- `config.json` - Model configuration with quantization settings
+- `quantization_config.json` - Specific quantization parameters
+- `processor/` - Audio processor files
+- `load_quantized_Xbit.py` - Example loading script
+---
+license: mit
+---