Delete 8bit
Browse files- 8bit/QUANTIZATION_README.md +0 -95
- 8bit/README.md +0 -23
- 8bit/config.json +0 -132
- 8bit/generation_config.json +0 -4
- 8bit/load_quantized_8bit.py +0 -60
- 8bit/minimal_memory_output.wav +0 -3
- 8bit/model-00001-of-00003.safetensors +0 -3
- 8bit/model-00002-of-00003.safetensors +0 -3
- 8bit/model-00003-of-00003.safetensors +0 -3
- 8bit/model.safetensors.index.json +0 -0
- 8bit/preprocessor_config.json +0 -12
- 8bit/quantization_config.json +0 -20
- 8bit/quantize_and_save_vibevoice.py +0 -330
- 8bit/test_accurate_vram.py +0 -207
- 8bit/use_quantized_model.py +0 -70
8bit/QUANTIZATION_README.md
DELETED
@@ -1,95 +0,0 @@
|
|
1 |
-
# VibeVoice Quantization Guide
|
2 |
-
|
3 |
-
Successfully quantized VibeVoice 7B model to both 4-bit and 8-bit versions using bitsandbytes!
|
4 |
-
|
5 |
-
## Model Sizes
|
6 |
-
|
7 |
-
| Model Version | Size | Memory Usage | Quality |
|
8 |
-
|---------------|------|--------------|---------|
|
9 |
-
| Original (fp16/bf16) | 18GB | ~18GB VRAM | Best |
|
10 |
-
| 8-bit Quantized | 9.9GB | ~10.6GB VRAM | Excellent |
|
11 |
-
| 4-bit Quantized (nf4) | 6.2GB | ~6.6GB VRAM | Very Good |
|
12 |
-
|
13 |
-
## How to Use Pre-Quantized Models
|
14 |
-
|
15 |
-
### 1. Loading 4-bit Model
|
16 |
-
|
17 |
-
```python
|
18 |
-
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
|
19 |
-
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
|
20 |
-
|
21 |
-
# Load pre-quantized 4-bit model
|
22 |
-
model_path = "/path/to/VibeVoice-Large-4bit"
|
23 |
-
processor = VibeVoiceProcessor.from_pretrained(model_path)
|
24 |
-
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
|
25 |
-
model_path,
|
26 |
-
device_map='cuda',
|
27 |
-
torch_dtype=torch.bfloat16,
|
28 |
-
)
|
29 |
-
```
|
30 |
-
|
31 |
-
### 2. Loading 8-bit Model
|
32 |
-
|
33 |
-
```python
|
34 |
-
# Same code, just point to 8-bit model
|
35 |
-
model_path = "/path/to/VibeVoice-Large-8bit"
|
36 |
-
# ... rest is the same
|
37 |
-
```
|
38 |
-
|
39 |
-
## Creating Your Own Quantized Models
|
40 |
-
|
41 |
-
Use the provided script to quantize models:
|
42 |
-
|
43 |
-
```bash
|
44 |
-
# 4-bit quantization (nf4)
|
45 |
-
python quantize_and_save_vibevoice.py \
|
46 |
-
--model_path /path/to/original/model \
|
47 |
-
--output_dir /path/to/output/4bit \
|
48 |
-
--bits 4 \
|
49 |
-
--test
|
50 |
-
|
51 |
-
# 8-bit quantization
|
52 |
-
python quantize_and_save_vibevoice.py \
|
53 |
-
--model_path /path/to/original/model \
|
54 |
-
--output_dir /path/to/output/8bit \
|
55 |
-
--bits 8 \
|
56 |
-
--test
|
57 |
-
```
|
58 |
-
|
59 |
-
## Benefits
|
60 |
-
|
61 |
-
1. **Pre-quantized models load faster** - No on-the-fly quantization needed
|
62 |
-
2. **Lower VRAM requirements** - 4-bit uses only ~6.6GB vs 18GB
|
63 |
-
3. **Shareable** - Upload the quantized folder to share with others
|
64 |
-
4. **Quality preserved** - nf4 quantization maintains excellent output quality
|
65 |
-
|
66 |
-
## Distribution
|
67 |
-
|
68 |
-
To share quantized models:
|
69 |
-
|
70 |
-
1. Upload the entire quantized model directory (e.g., `VibeVoice-Large-4bit/`)
|
71 |
-
2. Include the `quantization_config.json` file (automatically created)
|
72 |
-
3. Users can load directly without any quantization setup
|
73 |
-
|
74 |
-
## Performance Notes
|
75 |
-
|
76 |
-
- 4-bit (nf4): Best for memory-constrained systems, minimal quality loss
|
77 |
-
- 8-bit: Better quality than 4-bit, still significant memory savings
|
78 |
-
- Both versions maintain the same generation speed as the original
|
79 |
-
- Flash Attention 2 is supported in all quantized versions
|
80 |
-
|
81 |
-
## Troubleshooting
|
82 |
-
|
83 |
-
If loading fails:
|
84 |
-
1. Ensure you have `bitsandbytes` installed: `pip install bitsandbytes`
|
85 |
-
2. Make sure you're on a CUDA-capable GPU
|
86 |
-
3. Check that all model files are present in the directory
|
87 |
-
|
88 |
-
## Files Created
|
89 |
-
|
90 |
-
Each quantized model directory contains:
|
91 |
-
- `model.safetensors.*` - Quantized model weights
|
92 |
-
- `config.json` - Model configuration with quantization settings
|
93 |
-
- `quantization_config.json` - Specific quantization parameters
|
94 |
-
- `processor/` - Audio processor files
|
95 |
-
- `load_quantized_Xbit.py` - Example loading script
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8bit/README.md
DELETED
@@ -1,23 +0,0 @@
|
|
1 |
-
# VibeVoice 7B - 8-bit Quantized
|
2 |
-
|
3 |
-
Better quality with moderate VRAM requirements.
|
4 |
-
|
5 |
-
## Specifications
|
6 |
-
- Quantization: 8-bit (int8)
|
7 |
-
- Model size: 9.9 GB
|
8 |
-
- VRAM usage: ~12 GB
|
9 |
-
- Quality: Excellent (minimal degradation)
|
10 |
-
|
11 |
-
## Usage
|
12 |
-
|
13 |
-
```python
|
14 |
-
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
|
15 |
-
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
|
16 |
-
|
17 |
-
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
|
18 |
-
"Dannidee/VibeVoice7b-low-vram/8bit",
|
19 |
-
device_map='cuda',
|
20 |
-
torch_dtype=torch.bfloat16,
|
21 |
-
)
|
22 |
-
processor = VibeVoiceProcessor.from_pretrained("Dannidee/VibeVoice7b-low-vram/8bit")
|
23 |
-
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8bit/config.json
DELETED
@@ -1,132 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"acoustic_vae_dim": 64,
|
3 |
-
"acoustic_tokenizer_config": {
|
4 |
-
"causal": true,
|
5 |
-
"channels": 1,
|
6 |
-
"conv_bias": true,
|
7 |
-
"conv_norm": "none",
|
8 |
-
"corpus_normalize": 0.0,
|
9 |
-
"decoder_depths": null,
|
10 |
-
"decoder_n_filters": 32,
|
11 |
-
"decoder_ratios": [
|
12 |
-
8,
|
13 |
-
5,
|
14 |
-
5,
|
15 |
-
4,
|
16 |
-
2,
|
17 |
-
2
|
18 |
-
],
|
19 |
-
"disable_last_norm": true,
|
20 |
-
"encoder_depths": "3-3-3-3-3-3-8",
|
21 |
-
"encoder_n_filters": 32,
|
22 |
-
"encoder_ratios": [
|
23 |
-
8,
|
24 |
-
5,
|
25 |
-
5,
|
26 |
-
4,
|
27 |
-
2,
|
28 |
-
2
|
29 |
-
],
|
30 |
-
"fix_std": 0.5,
|
31 |
-
"layer_scale_init_value": 1e-06,
|
32 |
-
"layernorm": "RMSNorm",
|
33 |
-
"layernorm_elementwise_affine": true,
|
34 |
-
"layernorm_eps": 1e-05,
|
35 |
-
"mixer_layer": "depthwise_conv",
|
36 |
-
"model_type": "vibevoice_acoustic_tokenizer",
|
37 |
-
"pad_mode": "constant",
|
38 |
-
"std_dist_type": "gaussian",
|
39 |
-
"vae_dim": 64,
|
40 |
-
"weight_init_value": 0.01
|
41 |
-
},
|
42 |
-
"architectures": [
|
43 |
-
"VibeVoiceForConditionalGeneration"
|
44 |
-
],
|
45 |
-
"decoder_config": {
|
46 |
-
"attention_dropout": 0.0,
|
47 |
-
"hidden_act": "silu",
|
48 |
-
"hidden_size": 3584,
|
49 |
-
"initializer_range": 0.02,
|
50 |
-
"intermediate_size": 18944,
|
51 |
-
"max_position_embeddings": 32768,
|
52 |
-
"max_window_layers": 28,
|
53 |
-
"model_type": "qwen2",
|
54 |
-
"num_attention_heads": 28,
|
55 |
-
"num_hidden_layers": 28,
|
56 |
-
"num_key_value_heads": 4,
|
57 |
-
"rms_norm_eps": 1e-06,
|
58 |
-
"rope_scaling": null,
|
59 |
-
"rope_theta": 1000000.0,
|
60 |
-
"sliding_window": null,
|
61 |
-
"torch_dtype": "bfloat16",
|
62 |
-
"use_cache": true,
|
63 |
-
"use_mrope": false,
|
64 |
-
"use_sliding_window": false,
|
65 |
-
"vocab_size": 152064
|
66 |
-
},
|
67 |
-
"diffusion_head_config": {
|
68 |
-
"ddpm_batch_mul": 4,
|
69 |
-
"ddpm_beta_schedule": "cosine",
|
70 |
-
"ddpm_num_inference_steps": 20,
|
71 |
-
"ddpm_num_steps": 1000,
|
72 |
-
"diffusion_type": "ddpm",
|
73 |
-
"head_ffn_ratio": 3.0,
|
74 |
-
"head_layers": 4,
|
75 |
-
"hidden_size": 3584,
|
76 |
-
"latent_size": 64,
|
77 |
-
"model_type": "vibevoice_diffusion_head",
|
78 |
-
"prediction_type": "v_prediction",
|
79 |
-
"rms_norm_eps": 1e-05,
|
80 |
-
"speech_vae_dim": 64
|
81 |
-
},
|
82 |
-
"model_type": "vibevoice",
|
83 |
-
"semantic_tokenizer_config": {
|
84 |
-
"causal": true,
|
85 |
-
"channels": 1,
|
86 |
-
"conv_bias": true,
|
87 |
-
"conv_norm": "none",
|
88 |
-
"corpus_normalize": 0.0,
|
89 |
-
"disable_last_norm": true,
|
90 |
-
"encoder_depths": "3-3-3-3-3-3-8",
|
91 |
-
"encoder_n_filters": 32,
|
92 |
-
"encoder_ratios": [
|
93 |
-
8,
|
94 |
-
5,
|
95 |
-
5,
|
96 |
-
4,
|
97 |
-
2,
|
98 |
-
2
|
99 |
-
],
|
100 |
-
"fix_std": 0,
|
101 |
-
"layer_scale_init_value": 1e-06,
|
102 |
-
"layernorm": "RMSNorm",
|
103 |
-
"layernorm_elementwise_affine": true,
|
104 |
-
"layernorm_eps": 1e-05,
|
105 |
-
"mixer_layer": "depthwise_conv",
|
106 |
-
"model_type": "vibevoice_semantic_tokenizer",
|
107 |
-
"pad_mode": "constant",
|
108 |
-
"std_dist_type": "none",
|
109 |
-
"vae_dim": 128,
|
110 |
-
"weight_init_value": 0.01
|
111 |
-
},
|
112 |
-
"semantic_vae_dim": 128,
|
113 |
-
"tie_word_embeddings": false,
|
114 |
-
"torch_dtype": "bfloat16",
|
115 |
-
"transformers_version": "4.51.3",
|
116 |
-
"quantization_config": {
|
117 |
-
"quant_method": "bitsandbytes",
|
118 |
-
"_load_in_8bit": true,
|
119 |
-
"_load_in_4bit": false,
|
120 |
-
"llm_int8_threshold": 6.0,
|
121 |
-
"llm_int8_skip_modules": null,
|
122 |
-
"llm_int8_enable_fp32_cpu_offload": false,
|
123 |
-
"llm_int8_has_fp16_weight": false,
|
124 |
-
"bnb_4bit_quant_type": "fp4",
|
125 |
-
"bnb_4bit_use_double_quant": false,
|
126 |
-
"bnb_4bit_compute_dtype": "float32",
|
127 |
-
"bnb_4bit_quant_storage": "uint8",
|
128 |
-
"load_in_4bit": false,
|
129 |
-
"load_in_8bit": true
|
130 |
-
},
|
131 |
-
"_quantization_method": "bitsandbytes"
|
132 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8bit/generation_config.json
DELETED
@@ -1,4 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"_from_model_config": true,
|
3 |
-
"transformers_version": "4.51.3"
|
4 |
-
}
|
|
|
|
|
|
|
|
|
|
8bit/load_quantized_8bit.py
DELETED
@@ -1,60 +0,0 @@
|
|
1 |
-
#!/usr/bin/env python
|
2 |
-
"""
|
3 |
-
Load and use the 8-bit quantized VibeVoice model
|
4 |
-
"""
|
5 |
-
|
6 |
-
import torch
|
7 |
-
from transformers import BitsAndBytesConfig
|
8 |
-
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
|
9 |
-
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
|
10 |
-
|
11 |
-
def load_quantized_model(model_path="/home/deveraux/Desktop/vibevoice/VibeVoice-Large-8bit"):
|
12 |
-
"""Load the pre-quantized VibeVoice model"""
|
13 |
-
|
14 |
-
print("Loading 8-bit quantized VibeVoice model...")
|
15 |
-
|
16 |
-
# The model is already quantized, but we need to specify the config
|
17 |
-
# to ensure proper loading of quantized weights
|
18 |
-
bnb_config = BitsAndBytesConfig(
|
19 |
-
load_in_8bit=True,
|
20 |
-
bnb_8bit_compute_dtype=torch.bfloat16,
|
21 |
-
|
22 |
-
|
23 |
-
)
|
24 |
-
|
25 |
-
# Load processor
|
26 |
-
processor = VibeVoiceProcessor.from_pretrained(model_path)
|
27 |
-
|
28 |
-
# Load model
|
29 |
-
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
|
30 |
-
model_path,
|
31 |
-
quantization_config=bnb_config,
|
32 |
-
device_map='cuda',
|
33 |
-
torch_dtype=torch.bfloat16,
|
34 |
-
)
|
35 |
-
|
36 |
-
model.eval()
|
37 |
-
|
38 |
-
print("✅ Model loaded successfully!")
|
39 |
-
print(f"💾 Memory usage: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
|
40 |
-
|
41 |
-
return model, processor
|
42 |
-
|
43 |
-
# Example usage
|
44 |
-
if __name__ == "__main__":
|
45 |
-
model, processor = load_quantized_model()
|
46 |
-
|
47 |
-
# Generate audio
|
48 |
-
text = "Speaker 1: Hello! Speaker 2: Hi there!"
|
49 |
-
inputs = processor(
|
50 |
-
text=[text],
|
51 |
-
voice_samples=[["path/to/voice1.wav", "path/to/voice2.wav"]],
|
52 |
-
padding=True,
|
53 |
-
return_tensors="pt",
|
54 |
-
)
|
55 |
-
|
56 |
-
with torch.no_grad():
|
57 |
-
outputs = model.generate(**inputs)
|
58 |
-
|
59 |
-
# Save audio
|
60 |
-
processor.save_audio(outputs.speech_outputs[0], "output.wav")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8bit/minimal_memory_output.wav
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:c3cf133304229512e369c0b4db51c7d8ebbab43dd8c7945b5bf8e9b727185893
|
3 |
-
size 313644
|
|
|
|
|
|
|
|
8bit/model-00001-of-00003.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:68f98075dac463766219e6e61ff5fe9ab969f8fea621a65906f1d6793f2eaf72
|
3 |
-
size 4987685394
|
|
|
|
|
|
|
|
8bit/model-00002-of-00003.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:48940fb59366de226af5df46020f022d4d651f4563f190142c175b5bf733e9c7
|
3 |
-
size 4489976774
|
|
|
|
|
|
|
|
8bit/model-00003-of-00003.safetensors
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:d83c0514c0c9d2675cb4d51ee56b12515ea45770ce35acc5ab0ec4bc7d1bef73
|
3 |
-
size 1089994880
|
|
|
|
|
|
|
|
8bit/model.safetensors.index.json
DELETED
The diff for this file is too large to render.
See raw diff
|
|
8bit/preprocessor_config.json
DELETED
@@ -1,12 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"processor_class": "VibeVoiceProcessor",
|
3 |
-
"speech_tok_compress_ratio": 3200,
|
4 |
-
"db_normalize": true,
|
5 |
-
"audio_processor": {
|
6 |
-
"feature_extractor_type": "VibeVoiceTokenizerProcessor",
|
7 |
-
"sampling_rate": 24000,
|
8 |
-
"normalize_audio": true,
|
9 |
-
"target_dB_FS": -25,
|
10 |
-
"eps": 1e-06
|
11 |
-
}
|
12 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8bit/quantization_config.json
DELETED
@@ -1,20 +0,0 @@
|
|
1 |
-
{
|
2 |
-
"quantization_config": {
|
3 |
-
"quant_method": "bitsandbytes",
|
4 |
-
"_load_in_8bit": true,
|
5 |
-
"_load_in_4bit": false,
|
6 |
-
"llm_int8_threshold": 6.0,
|
7 |
-
"llm_int8_skip_modules": null,
|
8 |
-
"llm_int8_enable_fp32_cpu_offload": false,
|
9 |
-
"llm_int8_has_fp16_weight": false,
|
10 |
-
"bnb_4bit_quant_type": "fp4",
|
11 |
-
"bnb_4bit_use_double_quant": false,
|
12 |
-
"bnb_4bit_compute_dtype": "float32",
|
13 |
-
"bnb_4bit_quant_storage": "uint8",
|
14 |
-
"load_in_4bit": false,
|
15 |
-
"load_in_8bit": true
|
16 |
-
},
|
17 |
-
"quantization_method": "bitsandbytes",
|
18 |
-
"bits": 8,
|
19 |
-
"quant_type": "nf4"
|
20 |
-
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8bit/quantize_and_save_vibevoice.py
DELETED
@@ -1,330 +0,0 @@
|
|
1 |
-
#!/usr/bin/env python
|
2 |
-
"""
|
3 |
-
Quantize and save VibeVoice model using bitsandbytes
|
4 |
-
Creates a pre-quantized model that can be shared and loaded directly
|
5 |
-
"""
|
6 |
-
|
7 |
-
import os
|
8 |
-
import json
|
9 |
-
import shutil
|
10 |
-
import torch
|
11 |
-
from pathlib import Path
|
12 |
-
from transformers import BitsAndBytesConfig
|
13 |
-
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
|
14 |
-
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
|
15 |
-
from transformers.utils import logging
|
16 |
-
from safetensors.torch import save_file
|
17 |
-
|
18 |
-
logging.set_verbosity_info()
|
19 |
-
|
20 |
-
def quantize_and_save_model(
|
21 |
-
model_path: str,
|
22 |
-
output_dir: str,
|
23 |
-
bits: int = 4,
|
24 |
-
quant_type: str = "nf4"
|
25 |
-
):
|
26 |
-
"""Quantize VibeVoice model and save it for distribution"""
|
27 |
-
|
28 |
-
print(f"\n{'='*70}")
|
29 |
-
print(f"VIBEVOICE QUANTIZATION - {bits}-bit ({quant_type})")
|
30 |
-
print(f"{'='*70}")
|
31 |
-
print(f"Source: {model_path}")
|
32 |
-
print(f"Output: {output_dir}")
|
33 |
-
print(f"{'='*70}\n")
|
34 |
-
|
35 |
-
# Create output directory
|
36 |
-
output_path = Path(output_dir)
|
37 |
-
output_path.mkdir(parents=True, exist_ok=True)
|
38 |
-
|
39 |
-
# Configure quantization
|
40 |
-
if bits == 4:
|
41 |
-
bnb_config = BitsAndBytesConfig(
|
42 |
-
load_in_4bit=True,
|
43 |
-
bnb_4bit_compute_dtype=torch.bfloat16,
|
44 |
-
bnb_4bit_use_double_quant=True,
|
45 |
-
bnb_4bit_quant_type=quant_type
|
46 |
-
)
|
47 |
-
elif bits == 8:
|
48 |
-
bnb_config = BitsAndBytesConfig(
|
49 |
-
load_in_8bit=True,
|
50 |
-
bnb_8bit_compute_dtype=torch.bfloat16,
|
51 |
-
)
|
52 |
-
else:
|
53 |
-
raise ValueError(f"Unsupported bit width: {bits}")
|
54 |
-
|
55 |
-
print("🔧 Loading and quantizing model...")
|
56 |
-
|
57 |
-
# Load the model with quantization
|
58 |
-
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
|
59 |
-
model_path,
|
60 |
-
quantization_config=bnb_config,
|
61 |
-
device_map='cuda',
|
62 |
-
torch_dtype=torch.bfloat16,
|
63 |
-
)
|
64 |
-
|
65 |
-
# Get memory usage
|
66 |
-
memory_gb = torch.cuda.memory_allocated() / 1e9
|
67 |
-
print(f"💾 Quantized model memory usage: {memory_gb:.1f} GB")
|
68 |
-
|
69 |
-
# Save the quantized model
|
70 |
-
print("\n📦 Saving quantized model...")
|
71 |
-
|
72 |
-
# Method 1: Try using save_pretrained with quantization info
|
73 |
-
try:
|
74 |
-
# Save model with quantization config
|
75 |
-
model.save_pretrained(
|
76 |
-
output_path,
|
77 |
-
safe_serialization=True,
|
78 |
-
max_shard_size="5GB"
|
79 |
-
)
|
80 |
-
|
81 |
-
# Save the quantization config separately
|
82 |
-
quant_config_dict = {
|
83 |
-
"quantization_config": bnb_config.to_dict(),
|
84 |
-
"quantization_method": "bitsandbytes",
|
85 |
-
"bits": bits,
|
86 |
-
"quant_type": quant_type
|
87 |
-
}
|
88 |
-
|
89 |
-
with open(output_path / "quantization_config.json", 'w') as f:
|
90 |
-
json.dump(quant_config_dict, f, indent=2)
|
91 |
-
|
92 |
-
print("✅ Model saved with integrated quantization")
|
93 |
-
|
94 |
-
except Exception as e:
|
95 |
-
print(f"⚠️ Standard save failed: {e}")
|
96 |
-
print("Trying alternative save method...")
|
97 |
-
|
98 |
-
# Method 2: Save state dict with quantized weights
|
99 |
-
save_quantized_state_dict(model, output_path, bnb_config)
|
100 |
-
|
101 |
-
# Copy processor files
|
102 |
-
print("\n📋 Copying processor files...")
|
103 |
-
processor = VibeVoiceProcessor.from_pretrained(model_path)
|
104 |
-
processor.save_pretrained(output_path)
|
105 |
-
|
106 |
-
# Copy additional config files
|
107 |
-
for file in ["config.json", "generation_config.json"]:
|
108 |
-
src = Path(model_path) / file
|
109 |
-
if src.exists():
|
110 |
-
shutil.copy2(src, output_path / file)
|
111 |
-
|
112 |
-
# Update config to indicate quantization
|
113 |
-
config_path = output_path / "config.json"
|
114 |
-
if config_path.exists():
|
115 |
-
with open(config_path, 'r') as f:
|
116 |
-
config = json.load(f)
|
117 |
-
|
118 |
-
config["quantization_config"] = bnb_config.to_dict()
|
119 |
-
config["_quantization_method"] = "bitsandbytes"
|
120 |
-
|
121 |
-
with open(config_path, 'w') as f:
|
122 |
-
json.dump(config, f, indent=2)
|
123 |
-
|
124 |
-
print(f"\n✅ Quantized model saved to: {output_path}")
|
125 |
-
|
126 |
-
# Create loading script
|
127 |
-
create_loading_script(output_path, bits, quant_type)
|
128 |
-
|
129 |
-
return output_path
|
130 |
-
|
131 |
-
def save_quantized_state_dict(model, output_path, bnb_config):
|
132 |
-
"""Alternative method to save quantized weights"""
|
133 |
-
print("\n🔧 Saving quantized state dict...")
|
134 |
-
|
135 |
-
# Get the state dict
|
136 |
-
state_dict = model.state_dict()
|
137 |
-
|
138 |
-
# Separate quantized and non-quantized parameters
|
139 |
-
quantized_state = {}
|
140 |
-
metadata = {
|
141 |
-
"quantized_modules": [],
|
142 |
-
"quantization_config": bnb_config.to_dict()
|
143 |
-
}
|
144 |
-
|
145 |
-
for name, param in state_dict.items():
|
146 |
-
# Check if this is a quantized parameter
|
147 |
-
if hasattr(param, 'quant_state'):
|
148 |
-
# Store quantization state
|
149 |
-
metadata["quantized_modules"].append(name)
|
150 |
-
quantized_state[name] = param.data
|
151 |
-
else:
|
152 |
-
# Regular parameter
|
153 |
-
quantized_state[name] = param
|
154 |
-
|
155 |
-
# Save using safetensors
|
156 |
-
save_file(quantized_state, output_path / "model.safetensors", metadata=metadata)
|
157 |
-
|
158 |
-
# Save metadata
|
159 |
-
with open(output_path / "quantization_metadata.json", 'w') as f:
|
160 |
-
json.dump(metadata, f, indent=2)
|
161 |
-
|
162 |
-
def create_loading_script(output_path, bits, quant_type):
|
163 |
-
"""Create a script to load the quantized model"""
|
164 |
-
|
165 |
-
script_content = f'''#!/usr/bin/env python
|
166 |
-
"""
|
167 |
-
Load and use the {bits}-bit quantized VibeVoice model
|
168 |
-
"""
|
169 |
-
|
170 |
-
import torch
|
171 |
-
from transformers import BitsAndBytesConfig
|
172 |
-
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
|
173 |
-
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
|
174 |
-
|
175 |
-
def load_quantized_model(model_path="{output_path}"):
|
176 |
-
"""Load the pre-quantized VibeVoice model"""
|
177 |
-
|
178 |
-
print("Loading {bits}-bit quantized VibeVoice model...")
|
179 |
-
|
180 |
-
# The model is already quantized, but we need to specify the config
|
181 |
-
# to ensure proper loading of quantized weights
|
182 |
-
bnb_config = BitsAndBytesConfig(
|
183 |
-
load_in_{bits}bit=True,
|
184 |
-
bnb_{bits}bit_compute_dtype=torch.bfloat16,
|
185 |
-
{"bnb_4bit_use_double_quant=True," if bits == 4 else ""}
|
186 |
-
{"bnb_4bit_quant_type='" + quant_type + "'" if bits == 4 else ""}
|
187 |
-
)
|
188 |
-
|
189 |
-
# Load processor
|
190 |
-
processor = VibeVoiceProcessor.from_pretrained(model_path)
|
191 |
-
|
192 |
-
# Load model
|
193 |
-
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
|
194 |
-
model_path,
|
195 |
-
quantization_config=bnb_config,
|
196 |
-
device_map='cuda',
|
197 |
-
torch_dtype=torch.bfloat16,
|
198 |
-
)
|
199 |
-
|
200 |
-
model.eval()
|
201 |
-
|
202 |
-
print("✅ Model loaded successfully!")
|
203 |
-
print(f"💾 Memory usage: {{torch.cuda.memory_allocated() / 1e9:.1f}} GB")
|
204 |
-
|
205 |
-
return model, processor
|
206 |
-
|
207 |
-
# Example usage
|
208 |
-
if __name__ == "__main__":
|
209 |
-
model, processor = load_quantized_model()
|
210 |
-
|
211 |
-
# Generate audio
|
212 |
-
text = "Speaker 1: Hello! Speaker 2: Hi there!"
|
213 |
-
inputs = processor(
|
214 |
-
text=[text],
|
215 |
-
voice_samples=[["path/to/voice1.wav", "path/to/voice2.wav"]],
|
216 |
-
padding=True,
|
217 |
-
return_tensors="pt",
|
218 |
-
)
|
219 |
-
|
220 |
-
with torch.no_grad():
|
221 |
-
outputs = model.generate(**inputs)
|
222 |
-
|
223 |
-
# Save audio
|
224 |
-
processor.save_audio(outputs.speech_outputs[0], "output.wav")
|
225 |
-
'''
|
226 |
-
|
227 |
-
script_path = output_path / f"load_quantized_{bits}bit.py"
|
228 |
-
with open(script_path, 'w') as f:
|
229 |
-
f.write(script_content)
|
230 |
-
|
231 |
-
print(f"📝 Created loading script: {script_path}")
|
232 |
-
|
233 |
-
def test_quantized_model(model_path):
|
234 |
-
"""Test loading and generating with the quantized model"""
|
235 |
-
print(f"\n🧪 Testing quantized model from: {model_path}")
|
236 |
-
|
237 |
-
try:
|
238 |
-
# Load the quantized model
|
239 |
-
processor = VibeVoiceProcessor.from_pretrained(model_path)
|
240 |
-
|
241 |
-
# Load with auto-detection of quantization
|
242 |
-
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
|
243 |
-
model_path,
|
244 |
-
device_map='cuda',
|
245 |
-
torch_dtype=torch.bfloat16,
|
246 |
-
)
|
247 |
-
|
248 |
-
print("✅ Model loaded successfully!")
|
249 |
-
|
250 |
-
# Quick generation test
|
251 |
-
test_text = "Speaker 1: Testing quantized model. Speaker 2: It works!"
|
252 |
-
print(f"\n🎤 Testing generation with: '{test_text}'")
|
253 |
-
|
254 |
-
# Use demo voices
|
255 |
-
voices_dir = "/home/deveraux/Desktop/vibevoice/VibeVoice-main/demo/voices"
|
256 |
-
speaker_voices = [
|
257 |
-
os.path.join(voices_dir, "en-Alice_woman.wav"),
|
258 |
-
os.path.join(voices_dir, "en-Carter_man.wav")
|
259 |
-
]
|
260 |
-
|
261 |
-
inputs = processor(
|
262 |
-
text=[test_text],
|
263 |
-
voice_samples=[speaker_voices],
|
264 |
-
padding=True,
|
265 |
-
return_tensors="pt",
|
266 |
-
return_attention_mask=True,
|
267 |
-
)
|
268 |
-
|
269 |
-
with torch.no_grad():
|
270 |
-
outputs = model.generate(
|
271 |
-
**inputs,
|
272 |
-
max_new_tokens=None,
|
273 |
-
cfg_scale=1.3,
|
274 |
-
tokenizer=processor.tokenizer,
|
275 |
-
generation_config={'do_sample': False},
|
276 |
-
)
|
277 |
-
|
278 |
-
print("✅ Generation successful!")
|
279 |
-
|
280 |
-
# Save test output
|
281 |
-
output_path = Path(model_path) / "test_output.wav"
|
282 |
-
processor.save_audio(outputs.speech_outputs[0], output_path=str(output_path))
|
283 |
-
print(f"🔊 Test audio saved to: {output_path}")
|
284 |
-
|
285 |
-
return True
|
286 |
-
|
287 |
-
except Exception as e:
|
288 |
-
print(f"❌ Test failed: {e}")
|
289 |
-
return False
|
290 |
-
|
291 |
-
def main():
|
292 |
-
import argparse
|
293 |
-
parser = argparse.ArgumentParser(description="Quantize and save VibeVoice model")
|
294 |
-
parser.add_argument("--model_path", default="/home/deveraux/Desktop/vibevoice/VibeVoice-Large-pt",
|
295 |
-
help="Path to the original model")
|
296 |
-
parser.add_argument("--output_dir", default="/home/deveraux/Desktop/vibevoice/VibeVoice-Large-4bit",
|
297 |
-
help="Output directory for quantized model")
|
298 |
-
parser.add_argument("--bits", type=int, default=4, choices=[4, 8],
|
299 |
-
help="Quantization bits (4 or 8)")
|
300 |
-
parser.add_argument("--quant_type", default="nf4", choices=["nf4", "fp4"],
|
301 |
-
help="4-bit quantization type")
|
302 |
-
parser.add_argument("--test", action="store_true",
|
303 |
-
help="Test the quantized model after saving")
|
304 |
-
|
305 |
-
args = parser.parse_args()
|
306 |
-
|
307 |
-
# Update output dir based on bits
|
308 |
-
if str(args.bits) not in args.output_dir:
|
309 |
-
args.output_dir = args.output_dir.replace("4bit", f"{args.bits}bit")
|
310 |
-
|
311 |
-
# Quantize and save
|
312 |
-
output_path = quantize_and_save_model(
|
313 |
-
args.model_path,
|
314 |
-
args.output_dir,
|
315 |
-
args.bits,
|
316 |
-
args.quant_type
|
317 |
-
)
|
318 |
-
|
319 |
-
# Test if requested
|
320 |
-
if args.test:
|
321 |
-
test_quantized_model(output_path)
|
322 |
-
|
323 |
-
print(f"\n🎉 Done! Quantized model ready for distribution at: {output_path}")
|
324 |
-
print(f"\n📦 To share this model:")
|
325 |
-
print(f"1. Upload the entire '{output_path}' directory")
|
326 |
-
print(f"2. Users can load it with the provided script or directly with transformers")
|
327 |
-
print(f"3. The model will load in {args.bits}-bit without additional quantization")
|
328 |
-
|
329 |
-
if __name__ == "__main__":
|
330 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8bit/test_accurate_vram.py
DELETED
@@ -1,207 +0,0 @@
|
|
1 |
-
#!/usr/bin/env python
|
2 |
-
"""
|
3 |
-
Accurate VRAM measurement for VibeVoice models
|
4 |
-
Shows the difference between allocated vs reserved memory
|
5 |
-
"""
|
6 |
-
|
7 |
-
import os
|
8 |
-
import gc
|
9 |
-
import torch
|
10 |
-
import subprocess
|
11 |
-
import time
|
12 |
-
from pathlib import Path
|
13 |
-
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
|
14 |
-
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
|
15 |
-
|
16 |
-
def get_gpu_memory_info():
|
17 |
-
"""Get detailed GPU memory information"""
|
18 |
-
if not torch.cuda.is_available():
|
19 |
-
return {}
|
20 |
-
|
21 |
-
# PyTorch memory stats
|
22 |
-
allocated = torch.cuda.memory_allocated() / 1e9
|
23 |
-
reserved = torch.cuda.memory_reserved() / 1e9
|
24 |
-
|
25 |
-
# Get nvidia-smi info
|
26 |
-
try:
|
27 |
-
result = subprocess.run([
|
28 |
-
'nvidia-smi',
|
29 |
-
'--query-gpu=memory.used,memory.total',
|
30 |
-
'--format=csv,nounits,noheader'
|
31 |
-
], capture_output=True, text=True)
|
32 |
-
|
33 |
-
if result.returncode == 0:
|
34 |
-
used, total = map(int, result.stdout.strip().split(','))
|
35 |
-
nvidia_used_gb = used / 1024 # Convert MB to GB
|
36 |
-
nvidia_total_gb = total / 1024
|
37 |
-
else:
|
38 |
-
nvidia_used_gb = 0
|
39 |
-
nvidia_total_gb = 0
|
40 |
-
except:
|
41 |
-
nvidia_used_gb = 0
|
42 |
-
nvidia_total_gb = 0
|
43 |
-
|
44 |
-
return {
|
45 |
-
'allocated': allocated,
|
46 |
-
'reserved': reserved,
|
47 |
-
'nvidia_smi': nvidia_used_gb,
|
48 |
-
'nvidia_total': nvidia_total_gb
|
49 |
-
}
|
50 |
-
|
51 |
-
def print_memory_report(label, before, after):
|
52 |
-
"""Print detailed memory usage report"""
|
53 |
-
print(f"\n{label}:")
|
54 |
-
print(f" PyTorch Allocated: {before['allocated']:.2f} GB → {after['allocated']:.2f} GB "
|
55 |
-
f"(+{after['allocated'] - before['allocated']:.2f} GB)")
|
56 |
-
print(f" PyTorch Reserved: {before['reserved']:.2f} GB → {after['reserved']:.2f} GB "
|
57 |
-
f"(+{after['reserved'] - before['reserved']:.2f} GB)")
|
58 |
-
print(f" nvidia-smi Total: {before['nvidia_smi']:.2f} GB → {after['nvidia_smi']:.2f} GB "
|
59 |
-
f"(+{after['nvidia_smi'] - before['nvidia_smi']:.2f} GB)")
|
60 |
-
print(f" Memory Overhead: {after['reserved'] - after['allocated']:.2f} GB (PyTorch cache)")
|
61 |
-
|
62 |
-
def clear_gpu_memory():
|
63 |
-
"""Aggressively clear GPU memory"""
|
64 |
-
gc.collect()
|
65 |
-
if torch.cuda.is_available():
|
66 |
-
torch.cuda.empty_cache()
|
67 |
-
torch.cuda.synchronize()
|
68 |
-
# Force memory pool cleanup
|
69 |
-
torch.cuda.reset_peak_memory_stats()
|
70 |
-
|
71 |
-
def test_model_memory(model_path, model_name):
|
72 |
-
"""Test model with detailed memory tracking"""
|
73 |
-
print(f"\n{'='*70}")
|
74 |
-
print(f"Testing {model_name}")
|
75 |
-
print(f"{'='*70}")
|
76 |
-
|
77 |
-
# Clear memory and get baseline
|
78 |
-
clear_gpu_memory()
|
79 |
-
time.sleep(2) # Let memory settle
|
80 |
-
|
81 |
-
baseline = get_gpu_memory_info()
|
82 |
-
print(f"\nBaseline GPU Memory:")
|
83 |
-
print(f" PyTorch Allocated: {baseline['allocated']:.2f} GB")
|
84 |
-
print(f" PyTorch Reserved: {baseline['reserved']:.2f} GB")
|
85 |
-
print(f" nvidia-smi Shows: {baseline['nvidia_smi']:.2f} GB / {baseline['nvidia_total']:.2f} GB")
|
86 |
-
|
87 |
-
# Load model
|
88 |
-
print(f"\nLoading {model_name}...")
|
89 |
-
load_start = time.time()
|
90 |
-
|
91 |
-
processor = VibeVoiceProcessor.from_pretrained(model_path)
|
92 |
-
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
|
93 |
-
model_path,
|
94 |
-
device_map='cuda',
|
95 |
-
torch_dtype=torch.bfloat16,
|
96 |
-
)
|
97 |
-
model.eval()
|
98 |
-
|
99 |
-
load_time = time.time() - load_start
|
100 |
-
|
101 |
-
# Get memory after loading
|
102 |
-
loaded = get_gpu_memory_info()
|
103 |
-
print_memory_report("After Model Loading", baseline, loaded)
|
104 |
-
|
105 |
-
# Test generation to see peak usage
|
106 |
-
print(f"\nTesting generation...")
|
107 |
-
test_text = "Speaker 1: Testing memory usage. Speaker 2: Let's see the results!"
|
108 |
-
voices_dir = "/home/deveraux/Desktop/vibevoice/VibeVoice-main/demo/voices"
|
109 |
-
speaker_voices = [
|
110 |
-
os.path.join(voices_dir, "en-Alice_woman.wav"),
|
111 |
-
os.path.join(voices_dir, "en-Carter_man.wav")
|
112 |
-
]
|
113 |
-
|
114 |
-
inputs = processor(
|
115 |
-
text=[test_text],
|
116 |
-
voice_samples=[speaker_voices],
|
117 |
-
padding=True,
|
118 |
-
return_tensors="pt",
|
119 |
-
return_attention_mask=True,
|
120 |
-
)
|
121 |
-
|
122 |
-
# Monitor during generation
|
123 |
-
pre_gen = get_gpu_memory_info()
|
124 |
-
|
125 |
-
with torch.no_grad():
|
126 |
-
outputs = model.generate(
|
127 |
-
**inputs,
|
128 |
-
max_new_tokens=None,
|
129 |
-
cfg_scale=1.3,
|
130 |
-
tokenizer=processor.tokenizer,
|
131 |
-
generation_config={'do_sample': False},
|
132 |
-
)
|
133 |
-
|
134 |
-
post_gen = get_gpu_memory_info()
|
135 |
-
print_memory_report("During Generation", pre_gen, post_gen)
|
136 |
-
|
137 |
-
# Peak memory stats
|
138 |
-
if torch.cuda.is_available():
|
139 |
-
peak_memory = torch.cuda.max_memory_allocated() / 1e9
|
140 |
-
peak_reserved = torch.cuda.max_memory_reserved() / 1e9
|
141 |
-
print(f"\nPeak Memory Usage:")
|
142 |
-
print(f" Peak Allocated: {peak_memory:.2f} GB")
|
143 |
-
print(f" Peak Reserved: {peak_reserved:.2f} GB")
|
144 |
-
|
145 |
-
# Clean up
|
146 |
-
del model
|
147 |
-
del processor
|
148 |
-
clear_gpu_memory()
|
149 |
-
|
150 |
-
return {
|
151 |
-
'name': model_name,
|
152 |
-
'allocated': loaded['allocated'] - baseline['allocated'],
|
153 |
-
'reserved': loaded['reserved'] - baseline['reserved'],
|
154 |
-
'nvidia_smi': loaded['nvidia_smi'] - baseline['nvidia_smi'],
|
155 |
-
'peak_allocated': peak_memory,
|
156 |
-
'peak_reserved': peak_reserved
|
157 |
-
}
|
158 |
-
|
159 |
-
def main():
|
160 |
-
print("="*70)
|
161 |
-
print("ACCURATE VRAM MEASUREMENT FOR VIBEVOICE")
|
162 |
-
print("="*70)
|
163 |
-
print("\nNote: PyTorch reserves extra memory for efficiency.")
|
164 |
-
print("nvidia-smi shows total reserved memory, not just allocated.")
|
165 |
-
|
166 |
-
models = [
|
167 |
-
{
|
168 |
-
"path": "/home/deveraux/Desktop/vibevoice/VibeVoice-Large-pt",
|
169 |
-
"name": "16-bit Original"
|
170 |
-
},
|
171 |
-
{
|
172 |
-
"path": "/home/deveraux/Desktop/vibevoice/VibeVoice-Large-4bit",
|
173 |
-
"name": "4-bit Quantized"
|
174 |
-
}
|
175 |
-
]
|
176 |
-
|
177 |
-
results = []
|
178 |
-
for model_info in models:
|
179 |
-
try:
|
180 |
-
result = test_model_memory(model_info["path"], model_info["name"])
|
181 |
-
results.append(result)
|
182 |
-
time.sleep(5)
|
183 |
-
except Exception as e:
|
184 |
-
print(f"Error testing {model_info['name']}: {e}")
|
185 |
-
|
186 |
-
# Summary
|
187 |
-
print("\n" + "="*70)
|
188 |
-
print("MEMORY USAGE SUMMARY")
|
189 |
-
print("="*70)
|
190 |
-
print(f"\n{'Model':<20} {'Allocated':<12} {'Reserved':<12} {'nvidia-smi':<12} {'Peak':<12}")
|
191 |
-
print("-"*70)
|
192 |
-
|
193 |
-
for r in results:
|
194 |
-
print(f"{r['name']:<20} "
|
195 |
-
f"{r['allocated']:<12.2f} "
|
196 |
-
f"{r['reserved']:<12.2f} "
|
197 |
-
f"{r['nvidia_smi']:<12.2f} "
|
198 |
-
f"{r['peak_allocated']:<12.2f}")
|
199 |
-
|
200 |
-
print("\n💡 Key Insights:")
|
201 |
-
print("- 'Allocated' = Actual model weights in memory")
|
202 |
-
print("- 'Reserved' = Total GPU memory reserved by PyTorch (includes cache)")
|
203 |
-
print("- 'nvidia-smi' = What nvidia-smi reports (includes all overhead)")
|
204 |
-
print("- The difference is PyTorch's memory pool for efficiency")
|
205 |
-
|
206 |
-
if __name__ == "__main__":
|
207 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8bit/use_quantized_model.py
DELETED
@@ -1,70 +0,0 @@
|
|
1 |
-
#!/usr/bin/env python
|
2 |
-
"""
|
3 |
-
Simple example of using the pre-quantized VibeVoice model
|
4 |
-
No need for on-the-fly quantization - loads much faster!
|
5 |
-
"""
|
6 |
-
|
7 |
-
import os
|
8 |
-
import torch
|
9 |
-
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
|
10 |
-
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
|
11 |
-
|
12 |
-
def main():
|
13 |
-
# Path to the pre-quantized model
|
14 |
-
model_path = "/home/deveraux/Desktop/vibevoice/VibeVoice-Large-4bit"
|
15 |
-
|
16 |
-
print("Loading pre-quantized VibeVoice 4-bit model...")
|
17 |
-
|
18 |
-
# Load processor
|
19 |
-
processor = VibeVoiceProcessor.from_pretrained(model_path)
|
20 |
-
|
21 |
-
# Load the pre-quantized model
|
22 |
-
# The quantization config is already saved in the model
|
23 |
-
model = VibeVoiceForConditionalGenerationInference.from_pretrained(
|
24 |
-
model_path,
|
25 |
-
device_map='cuda',
|
26 |
-
torch_dtype=torch.bfloat16,
|
27 |
-
)
|
28 |
-
model.eval()
|
29 |
-
|
30 |
-
# Check memory usage
|
31 |
-
memory_gb = torch.cuda.memory_allocated() / 1e9
|
32 |
-
print(f"✅ Model loaded! Memory usage: {memory_gb:.1f} GB")
|
33 |
-
|
34 |
-
# Example generation
|
35 |
-
text = "Speaker 1: Welcome to our podcast! Speaker 2: Thanks for having me!"
|
36 |
-
|
37 |
-
# Voice samples (using demo voices)
|
38 |
-
voices_dir = "/home/deveraux/Desktop/vibevoice/VibeVoice-main/demo/voices"
|
39 |
-
speaker_voices = [
|
40 |
-
os.path.join(voices_dir, "en-Alice_woman.wav"),
|
41 |
-
os.path.join(voices_dir, "en-Carter_man.wav")
|
42 |
-
]
|
43 |
-
|
44 |
-
# Process inputs
|
45 |
-
inputs = processor(
|
46 |
-
text=[text],
|
47 |
-
voice_samples=[speaker_voices],
|
48 |
-
padding=True,
|
49 |
-
return_tensors="pt",
|
50 |
-
return_attention_mask=True,
|
51 |
-
)
|
52 |
-
|
53 |
-
# Generate
|
54 |
-
print(f"\nGenerating: '{text}'")
|
55 |
-
with torch.no_grad():
|
56 |
-
outputs = model.generate(
|
57 |
-
**inputs,
|
58 |
-
max_new_tokens=None,
|
59 |
-
cfg_scale=1.3,
|
60 |
-
tokenizer=processor.tokenizer,
|
61 |
-
generation_config={'do_sample': False},
|
62 |
-
)
|
63 |
-
|
64 |
-
# Save output
|
65 |
-
output_path = "quantized_output.wav"
|
66 |
-
processor.save_audio(outputs.speech_outputs[0], output_path=output_path)
|
67 |
-
print(f"✅ Audio saved to: {output_path}")
|
68 |
-
|
69 |
-
if __name__ == "__main__":
|
70 |
-
main()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|