DevParker commited on
Commit
cf06904
·
verified ·
1 Parent(s): b6d2f8e

Delete 8bit

Browse files
8bit/QUANTIZATION_README.md DELETED
@@ -1,95 +0,0 @@
1
- # VibeVoice Quantization Guide
2
-
3
- Successfully quantized VibeVoice 7B model to both 4-bit and 8-bit versions using bitsandbytes!
4
-
5
- ## Model Sizes
6
-
7
- | Model Version | Size | Memory Usage | Quality |
8
- |---------------|------|--------------|---------|
9
- | Original (fp16/bf16) | 18GB | ~18GB VRAM | Best |
10
- | 8-bit Quantized | 9.9GB | ~10.6GB VRAM | Excellent |
11
- | 4-bit Quantized (nf4) | 6.2GB | ~6.6GB VRAM | Very Good |
12
-
13
- ## How to Use Pre-Quantized Models
14
-
15
- ### 1. Loading 4-bit Model
16
-
17
- ```python
18
- from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
19
- from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
20
-
21
- # Load pre-quantized 4-bit model
22
- model_path = "/path/to/VibeVoice-Large-4bit"
23
- processor = VibeVoiceProcessor.from_pretrained(model_path)
24
- model = VibeVoiceForConditionalGenerationInference.from_pretrained(
25
- model_path,
26
- device_map='cuda',
27
- torch_dtype=torch.bfloat16,
28
- )
29
- ```
30
-
31
- ### 2. Loading 8-bit Model
32
-
33
- ```python
34
- # Same code, just point to 8-bit model
35
- model_path = "/path/to/VibeVoice-Large-8bit"
36
- # ... rest is the same
37
- ```
38
-
39
- ## Creating Your Own Quantized Models
40
-
41
- Use the provided script to quantize models:
42
-
43
- ```bash
44
- # 4-bit quantization (nf4)
45
- python quantize_and_save_vibevoice.py \
46
- --model_path /path/to/original/model \
47
- --output_dir /path/to/output/4bit \
48
- --bits 4 \
49
- --test
50
-
51
- # 8-bit quantization
52
- python quantize_and_save_vibevoice.py \
53
- --model_path /path/to/original/model \
54
- --output_dir /path/to/output/8bit \
55
- --bits 8 \
56
- --test
57
- ```
58
-
59
- ## Benefits
60
-
61
- 1. **Pre-quantized models load faster** - No on-the-fly quantization needed
62
- 2. **Lower VRAM requirements** - 4-bit uses only ~6.6GB vs 18GB
63
- 3. **Shareable** - Upload the quantized folder to share with others
64
- 4. **Quality preserved** - nf4 quantization maintains excellent output quality
65
-
66
- ## Distribution
67
-
68
- To share quantized models:
69
-
70
- 1. Upload the entire quantized model directory (e.g., `VibeVoice-Large-4bit/`)
71
- 2. Include the `quantization_config.json` file (automatically created)
72
- 3. Users can load directly without any quantization setup
73
-
74
- ## Performance Notes
75
-
76
- - 4-bit (nf4): Best for memory-constrained systems, minimal quality loss
77
- - 8-bit: Better quality than 4-bit, still significant memory savings
78
- - Both versions maintain the same generation speed as the original
79
- - Flash Attention 2 is supported in all quantized versions
80
-
81
- ## Troubleshooting
82
-
83
- If loading fails:
84
- 1. Ensure you have `bitsandbytes` installed: `pip install bitsandbytes`
85
- 2. Make sure you're on a CUDA-capable GPU
86
- 3. Check that all model files are present in the directory
87
-
88
- ## Files Created
89
-
90
- Each quantized model directory contains:
91
- - `model.safetensors.*` - Quantized model weights
92
- - `config.json` - Model configuration with quantization settings
93
- - `quantization_config.json` - Specific quantization parameters
94
- - `processor/` - Audio processor files
95
- - `load_quantized_Xbit.py` - Example loading script
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8bit/README.md DELETED
@@ -1,23 +0,0 @@
1
- # VibeVoice 7B - 8-bit Quantized
2
-
3
- Better quality with moderate VRAM requirements.
4
-
5
- ## Specifications
6
- - Quantization: 8-bit (int8)
7
- - Model size: 9.9 GB
8
- - VRAM usage: ~12 GB
9
- - Quality: Excellent (minimal degradation)
10
-
11
- ## Usage
12
-
13
- ```python
14
- from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
15
- from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
16
-
17
- model = VibeVoiceForConditionalGenerationInference.from_pretrained(
18
- "Dannidee/VibeVoice7b-low-vram/8bit",
19
- device_map='cuda',
20
- torch_dtype=torch.bfloat16,
21
- )
22
- processor = VibeVoiceProcessor.from_pretrained("Dannidee/VibeVoice7b-low-vram/8bit")
23
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8bit/config.json DELETED
@@ -1,132 +0,0 @@
1
- {
2
- "acoustic_vae_dim": 64,
3
- "acoustic_tokenizer_config": {
4
- "causal": true,
5
- "channels": 1,
6
- "conv_bias": true,
7
- "conv_norm": "none",
8
- "corpus_normalize": 0.0,
9
- "decoder_depths": null,
10
- "decoder_n_filters": 32,
11
- "decoder_ratios": [
12
- 8,
13
- 5,
14
- 5,
15
- 4,
16
- 2,
17
- 2
18
- ],
19
- "disable_last_norm": true,
20
- "encoder_depths": "3-3-3-3-3-3-8",
21
- "encoder_n_filters": 32,
22
- "encoder_ratios": [
23
- 8,
24
- 5,
25
- 5,
26
- 4,
27
- 2,
28
- 2
29
- ],
30
- "fix_std": 0.5,
31
- "layer_scale_init_value": 1e-06,
32
- "layernorm": "RMSNorm",
33
- "layernorm_elementwise_affine": true,
34
- "layernorm_eps": 1e-05,
35
- "mixer_layer": "depthwise_conv",
36
- "model_type": "vibevoice_acoustic_tokenizer",
37
- "pad_mode": "constant",
38
- "std_dist_type": "gaussian",
39
- "vae_dim": 64,
40
- "weight_init_value": 0.01
41
- },
42
- "architectures": [
43
- "VibeVoiceForConditionalGeneration"
44
- ],
45
- "decoder_config": {
46
- "attention_dropout": 0.0,
47
- "hidden_act": "silu",
48
- "hidden_size": 3584,
49
- "initializer_range": 0.02,
50
- "intermediate_size": 18944,
51
- "max_position_embeddings": 32768,
52
- "max_window_layers": 28,
53
- "model_type": "qwen2",
54
- "num_attention_heads": 28,
55
- "num_hidden_layers": 28,
56
- "num_key_value_heads": 4,
57
- "rms_norm_eps": 1e-06,
58
- "rope_scaling": null,
59
- "rope_theta": 1000000.0,
60
- "sliding_window": null,
61
- "torch_dtype": "bfloat16",
62
- "use_cache": true,
63
- "use_mrope": false,
64
- "use_sliding_window": false,
65
- "vocab_size": 152064
66
- },
67
- "diffusion_head_config": {
68
- "ddpm_batch_mul": 4,
69
- "ddpm_beta_schedule": "cosine",
70
- "ddpm_num_inference_steps": 20,
71
- "ddpm_num_steps": 1000,
72
- "diffusion_type": "ddpm",
73
- "head_ffn_ratio": 3.0,
74
- "head_layers": 4,
75
- "hidden_size": 3584,
76
- "latent_size": 64,
77
- "model_type": "vibevoice_diffusion_head",
78
- "prediction_type": "v_prediction",
79
- "rms_norm_eps": 1e-05,
80
- "speech_vae_dim": 64
81
- },
82
- "model_type": "vibevoice",
83
- "semantic_tokenizer_config": {
84
- "causal": true,
85
- "channels": 1,
86
- "conv_bias": true,
87
- "conv_norm": "none",
88
- "corpus_normalize": 0.0,
89
- "disable_last_norm": true,
90
- "encoder_depths": "3-3-3-3-3-3-8",
91
- "encoder_n_filters": 32,
92
- "encoder_ratios": [
93
- 8,
94
- 5,
95
- 5,
96
- 4,
97
- 2,
98
- 2
99
- ],
100
- "fix_std": 0,
101
- "layer_scale_init_value": 1e-06,
102
- "layernorm": "RMSNorm",
103
- "layernorm_elementwise_affine": true,
104
- "layernorm_eps": 1e-05,
105
- "mixer_layer": "depthwise_conv",
106
- "model_type": "vibevoice_semantic_tokenizer",
107
- "pad_mode": "constant",
108
- "std_dist_type": "none",
109
- "vae_dim": 128,
110
- "weight_init_value": 0.01
111
- },
112
- "semantic_vae_dim": 128,
113
- "tie_word_embeddings": false,
114
- "torch_dtype": "bfloat16",
115
- "transformers_version": "4.51.3",
116
- "quantization_config": {
117
- "quant_method": "bitsandbytes",
118
- "_load_in_8bit": true,
119
- "_load_in_4bit": false,
120
- "llm_int8_threshold": 6.0,
121
- "llm_int8_skip_modules": null,
122
- "llm_int8_enable_fp32_cpu_offload": false,
123
- "llm_int8_has_fp16_weight": false,
124
- "bnb_4bit_quant_type": "fp4",
125
- "bnb_4bit_use_double_quant": false,
126
- "bnb_4bit_compute_dtype": "float32",
127
- "bnb_4bit_quant_storage": "uint8",
128
- "load_in_4bit": false,
129
- "load_in_8bit": true
130
- },
131
- "_quantization_method": "bitsandbytes"
132
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8bit/generation_config.json DELETED
@@ -1,4 +0,0 @@
1
- {
2
- "_from_model_config": true,
3
- "transformers_version": "4.51.3"
4
- }
 
 
 
 
 
8bit/load_quantized_8bit.py DELETED
@@ -1,60 +0,0 @@
1
- #!/usr/bin/env python
2
- """
3
- Load and use the 8-bit quantized VibeVoice model
4
- """
5
-
6
- import torch
7
- from transformers import BitsAndBytesConfig
8
- from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
9
- from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
10
-
11
- def load_quantized_model(model_path="/home/deveraux/Desktop/vibevoice/VibeVoice-Large-8bit"):
12
- """Load the pre-quantized VibeVoice model"""
13
-
14
- print("Loading 8-bit quantized VibeVoice model...")
15
-
16
- # The model is already quantized, but we need to specify the config
17
- # to ensure proper loading of quantized weights
18
- bnb_config = BitsAndBytesConfig(
19
- load_in_8bit=True,
20
- bnb_8bit_compute_dtype=torch.bfloat16,
21
-
22
-
23
- )
24
-
25
- # Load processor
26
- processor = VibeVoiceProcessor.from_pretrained(model_path)
27
-
28
- # Load model
29
- model = VibeVoiceForConditionalGenerationInference.from_pretrained(
30
- model_path,
31
- quantization_config=bnb_config,
32
- device_map='cuda',
33
- torch_dtype=torch.bfloat16,
34
- )
35
-
36
- model.eval()
37
-
38
- print("✅ Model loaded successfully!")
39
- print(f"💾 Memory usage: {torch.cuda.memory_allocated() / 1e9:.1f} GB")
40
-
41
- return model, processor
42
-
43
- # Example usage
44
- if __name__ == "__main__":
45
- model, processor = load_quantized_model()
46
-
47
- # Generate audio
48
- text = "Speaker 1: Hello! Speaker 2: Hi there!"
49
- inputs = processor(
50
- text=[text],
51
- voice_samples=[["path/to/voice1.wav", "path/to/voice2.wav"]],
52
- padding=True,
53
- return_tensors="pt",
54
- )
55
-
56
- with torch.no_grad():
57
- outputs = model.generate(**inputs)
58
-
59
- # Save audio
60
- processor.save_audio(outputs.speech_outputs[0], "output.wav")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8bit/minimal_memory_output.wav DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:c3cf133304229512e369c0b4db51c7d8ebbab43dd8c7945b5bf8e9b727185893
3
- size 313644
 
 
 
 
8bit/model-00001-of-00003.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:68f98075dac463766219e6e61ff5fe9ab969f8fea621a65906f1d6793f2eaf72
3
- size 4987685394
 
 
 
 
8bit/model-00002-of-00003.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:48940fb59366de226af5df46020f022d4d651f4563f190142c175b5bf733e9c7
3
- size 4489976774
 
 
 
 
8bit/model-00003-of-00003.safetensors DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:d83c0514c0c9d2675cb4d51ee56b12515ea45770ce35acc5ab0ec4bc7d1bef73
3
- size 1089994880
 
 
 
 
8bit/model.safetensors.index.json DELETED
The diff for this file is too large to render. See raw diff
 
8bit/preprocessor_config.json DELETED
@@ -1,12 +0,0 @@
1
- {
2
- "processor_class": "VibeVoiceProcessor",
3
- "speech_tok_compress_ratio": 3200,
4
- "db_normalize": true,
5
- "audio_processor": {
6
- "feature_extractor_type": "VibeVoiceTokenizerProcessor",
7
- "sampling_rate": 24000,
8
- "normalize_audio": true,
9
- "target_dB_FS": -25,
10
- "eps": 1e-06
11
- }
12
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
8bit/quantization_config.json DELETED
@@ -1,20 +0,0 @@
1
- {
2
- "quantization_config": {
3
- "quant_method": "bitsandbytes",
4
- "_load_in_8bit": true,
5
- "_load_in_4bit": false,
6
- "llm_int8_threshold": 6.0,
7
- "llm_int8_skip_modules": null,
8
- "llm_int8_enable_fp32_cpu_offload": false,
9
- "llm_int8_has_fp16_weight": false,
10
- "bnb_4bit_quant_type": "fp4",
11
- "bnb_4bit_use_double_quant": false,
12
- "bnb_4bit_compute_dtype": "float32",
13
- "bnb_4bit_quant_storage": "uint8",
14
- "load_in_4bit": false,
15
- "load_in_8bit": true
16
- },
17
- "quantization_method": "bitsandbytes",
18
- "bits": 8,
19
- "quant_type": "nf4"
20
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8bit/quantize_and_save_vibevoice.py DELETED
@@ -1,330 +0,0 @@
1
- #!/usr/bin/env python
2
- """
3
- Quantize and save VibeVoice model using bitsandbytes
4
- Creates a pre-quantized model that can be shared and loaded directly
5
- """
6
-
7
- import os
8
- import json
9
- import shutil
10
- import torch
11
- from pathlib import Path
12
- from transformers import BitsAndBytesConfig
13
- from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
14
- from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
15
- from transformers.utils import logging
16
- from safetensors.torch import save_file
17
-
18
- logging.set_verbosity_info()
19
-
20
- def quantize_and_save_model(
21
- model_path: str,
22
- output_dir: str,
23
- bits: int = 4,
24
- quant_type: str = "nf4"
25
- ):
26
- """Quantize VibeVoice model and save it for distribution"""
27
-
28
- print(f"\n{'='*70}")
29
- print(f"VIBEVOICE QUANTIZATION - {bits}-bit ({quant_type})")
30
- print(f"{'='*70}")
31
- print(f"Source: {model_path}")
32
- print(f"Output: {output_dir}")
33
- print(f"{'='*70}\n")
34
-
35
- # Create output directory
36
- output_path = Path(output_dir)
37
- output_path.mkdir(parents=True, exist_ok=True)
38
-
39
- # Configure quantization
40
- if bits == 4:
41
- bnb_config = BitsAndBytesConfig(
42
- load_in_4bit=True,
43
- bnb_4bit_compute_dtype=torch.bfloat16,
44
- bnb_4bit_use_double_quant=True,
45
- bnb_4bit_quant_type=quant_type
46
- )
47
- elif bits == 8:
48
- bnb_config = BitsAndBytesConfig(
49
- load_in_8bit=True,
50
- bnb_8bit_compute_dtype=torch.bfloat16,
51
- )
52
- else:
53
- raise ValueError(f"Unsupported bit width: {bits}")
54
-
55
- print("🔧 Loading and quantizing model...")
56
-
57
- # Load the model with quantization
58
- model = VibeVoiceForConditionalGenerationInference.from_pretrained(
59
- model_path,
60
- quantization_config=bnb_config,
61
- device_map='cuda',
62
- torch_dtype=torch.bfloat16,
63
- )
64
-
65
- # Get memory usage
66
- memory_gb = torch.cuda.memory_allocated() / 1e9
67
- print(f"💾 Quantized model memory usage: {memory_gb:.1f} GB")
68
-
69
- # Save the quantized model
70
- print("\n📦 Saving quantized model...")
71
-
72
- # Method 1: Try using save_pretrained with quantization info
73
- try:
74
- # Save model with quantization config
75
- model.save_pretrained(
76
- output_path,
77
- safe_serialization=True,
78
- max_shard_size="5GB"
79
- )
80
-
81
- # Save the quantization config separately
82
- quant_config_dict = {
83
- "quantization_config": bnb_config.to_dict(),
84
- "quantization_method": "bitsandbytes",
85
- "bits": bits,
86
- "quant_type": quant_type
87
- }
88
-
89
- with open(output_path / "quantization_config.json", 'w') as f:
90
- json.dump(quant_config_dict, f, indent=2)
91
-
92
- print("✅ Model saved with integrated quantization")
93
-
94
- except Exception as e:
95
- print(f"⚠️ Standard save failed: {e}")
96
- print("Trying alternative save method...")
97
-
98
- # Method 2: Save state dict with quantized weights
99
- save_quantized_state_dict(model, output_path, bnb_config)
100
-
101
- # Copy processor files
102
- print("\n📋 Copying processor files...")
103
- processor = VibeVoiceProcessor.from_pretrained(model_path)
104
- processor.save_pretrained(output_path)
105
-
106
- # Copy additional config files
107
- for file in ["config.json", "generation_config.json"]:
108
- src = Path(model_path) / file
109
- if src.exists():
110
- shutil.copy2(src, output_path / file)
111
-
112
- # Update config to indicate quantization
113
- config_path = output_path / "config.json"
114
- if config_path.exists():
115
- with open(config_path, 'r') as f:
116
- config = json.load(f)
117
-
118
- config["quantization_config"] = bnb_config.to_dict()
119
- config["_quantization_method"] = "bitsandbytes"
120
-
121
- with open(config_path, 'w') as f:
122
- json.dump(config, f, indent=2)
123
-
124
- print(f"\n✅ Quantized model saved to: {output_path}")
125
-
126
- # Create loading script
127
- create_loading_script(output_path, bits, quant_type)
128
-
129
- return output_path
130
-
131
- def save_quantized_state_dict(model, output_path, bnb_config):
132
- """Alternative method to save quantized weights"""
133
- print("\n🔧 Saving quantized state dict...")
134
-
135
- # Get the state dict
136
- state_dict = model.state_dict()
137
-
138
- # Separate quantized and non-quantized parameters
139
- quantized_state = {}
140
- metadata = {
141
- "quantized_modules": [],
142
- "quantization_config": bnb_config.to_dict()
143
- }
144
-
145
- for name, param in state_dict.items():
146
- # Check if this is a quantized parameter
147
- if hasattr(param, 'quant_state'):
148
- # Store quantization state
149
- metadata["quantized_modules"].append(name)
150
- quantized_state[name] = param.data
151
- else:
152
- # Regular parameter
153
- quantized_state[name] = param
154
-
155
- # Save using safetensors
156
- save_file(quantized_state, output_path / "model.safetensors", metadata=metadata)
157
-
158
- # Save metadata
159
- with open(output_path / "quantization_metadata.json", 'w') as f:
160
- json.dump(metadata, f, indent=2)
161
-
162
- def create_loading_script(output_path, bits, quant_type):
163
- """Create a script to load the quantized model"""
164
-
165
- script_content = f'''#!/usr/bin/env python
166
- """
167
- Load and use the {bits}-bit quantized VibeVoice model
168
- """
169
-
170
- import torch
171
- from transformers import BitsAndBytesConfig
172
- from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
173
- from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
174
-
175
- def load_quantized_model(model_path="{output_path}"):
176
- """Load the pre-quantized VibeVoice model"""
177
-
178
- print("Loading {bits}-bit quantized VibeVoice model...")
179
-
180
- # The model is already quantized, but we need to specify the config
181
- # to ensure proper loading of quantized weights
182
- bnb_config = BitsAndBytesConfig(
183
- load_in_{bits}bit=True,
184
- bnb_{bits}bit_compute_dtype=torch.bfloat16,
185
- {"bnb_4bit_use_double_quant=True," if bits == 4 else ""}
186
- {"bnb_4bit_quant_type='" + quant_type + "'" if bits == 4 else ""}
187
- )
188
-
189
- # Load processor
190
- processor = VibeVoiceProcessor.from_pretrained(model_path)
191
-
192
- # Load model
193
- model = VibeVoiceForConditionalGenerationInference.from_pretrained(
194
- model_path,
195
- quantization_config=bnb_config,
196
- device_map='cuda',
197
- torch_dtype=torch.bfloat16,
198
- )
199
-
200
- model.eval()
201
-
202
- print("✅ Model loaded successfully!")
203
- print(f"💾 Memory usage: {{torch.cuda.memory_allocated() / 1e9:.1f}} GB")
204
-
205
- return model, processor
206
-
207
- # Example usage
208
- if __name__ == "__main__":
209
- model, processor = load_quantized_model()
210
-
211
- # Generate audio
212
- text = "Speaker 1: Hello! Speaker 2: Hi there!"
213
- inputs = processor(
214
- text=[text],
215
- voice_samples=[["path/to/voice1.wav", "path/to/voice2.wav"]],
216
- padding=True,
217
- return_tensors="pt",
218
- )
219
-
220
- with torch.no_grad():
221
- outputs = model.generate(**inputs)
222
-
223
- # Save audio
224
- processor.save_audio(outputs.speech_outputs[0], "output.wav")
225
- '''
226
-
227
- script_path = output_path / f"load_quantized_{bits}bit.py"
228
- with open(script_path, 'w') as f:
229
- f.write(script_content)
230
-
231
- print(f"📝 Created loading script: {script_path}")
232
-
233
- def test_quantized_model(model_path):
234
- """Test loading and generating with the quantized model"""
235
- print(f"\n🧪 Testing quantized model from: {model_path}")
236
-
237
- try:
238
- # Load the quantized model
239
- processor = VibeVoiceProcessor.from_pretrained(model_path)
240
-
241
- # Load with auto-detection of quantization
242
- model = VibeVoiceForConditionalGenerationInference.from_pretrained(
243
- model_path,
244
- device_map='cuda',
245
- torch_dtype=torch.bfloat16,
246
- )
247
-
248
- print("✅ Model loaded successfully!")
249
-
250
- # Quick generation test
251
- test_text = "Speaker 1: Testing quantized model. Speaker 2: It works!"
252
- print(f"\n🎤 Testing generation with: '{test_text}'")
253
-
254
- # Use demo voices
255
- voices_dir = "/home/deveraux/Desktop/vibevoice/VibeVoice-main/demo/voices"
256
- speaker_voices = [
257
- os.path.join(voices_dir, "en-Alice_woman.wav"),
258
- os.path.join(voices_dir, "en-Carter_man.wav")
259
- ]
260
-
261
- inputs = processor(
262
- text=[test_text],
263
- voice_samples=[speaker_voices],
264
- padding=True,
265
- return_tensors="pt",
266
- return_attention_mask=True,
267
- )
268
-
269
- with torch.no_grad():
270
- outputs = model.generate(
271
- **inputs,
272
- max_new_tokens=None,
273
- cfg_scale=1.3,
274
- tokenizer=processor.tokenizer,
275
- generation_config={'do_sample': False},
276
- )
277
-
278
- print("✅ Generation successful!")
279
-
280
- # Save test output
281
- output_path = Path(model_path) / "test_output.wav"
282
- processor.save_audio(outputs.speech_outputs[0], output_path=str(output_path))
283
- print(f"🔊 Test audio saved to: {output_path}")
284
-
285
- return True
286
-
287
- except Exception as e:
288
- print(f"❌ Test failed: {e}")
289
- return False
290
-
291
- def main():
292
- import argparse
293
- parser = argparse.ArgumentParser(description="Quantize and save VibeVoice model")
294
- parser.add_argument("--model_path", default="/home/deveraux/Desktop/vibevoice/VibeVoice-Large-pt",
295
- help="Path to the original model")
296
- parser.add_argument("--output_dir", default="/home/deveraux/Desktop/vibevoice/VibeVoice-Large-4bit",
297
- help="Output directory for quantized model")
298
- parser.add_argument("--bits", type=int, default=4, choices=[4, 8],
299
- help="Quantization bits (4 or 8)")
300
- parser.add_argument("--quant_type", default="nf4", choices=["nf4", "fp4"],
301
- help="4-bit quantization type")
302
- parser.add_argument("--test", action="store_true",
303
- help="Test the quantized model after saving")
304
-
305
- args = parser.parse_args()
306
-
307
- # Update output dir based on bits
308
- if str(args.bits) not in args.output_dir:
309
- args.output_dir = args.output_dir.replace("4bit", f"{args.bits}bit")
310
-
311
- # Quantize and save
312
- output_path = quantize_and_save_model(
313
- args.model_path,
314
- args.output_dir,
315
- args.bits,
316
- args.quant_type
317
- )
318
-
319
- # Test if requested
320
- if args.test:
321
- test_quantized_model(output_path)
322
-
323
- print(f"\n🎉 Done! Quantized model ready for distribution at: {output_path}")
324
- print(f"\n📦 To share this model:")
325
- print(f"1. Upload the entire '{output_path}' directory")
326
- print(f"2. Users can load it with the provided script or directly with transformers")
327
- print(f"3. The model will load in {args.bits}-bit without additional quantization")
328
-
329
- if __name__ == "__main__":
330
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8bit/test_accurate_vram.py DELETED
@@ -1,207 +0,0 @@
1
- #!/usr/bin/env python
2
- """
3
- Accurate VRAM measurement for VibeVoice models
4
- Shows the difference between allocated vs reserved memory
5
- """
6
-
7
- import os
8
- import gc
9
- import torch
10
- import subprocess
11
- import time
12
- from pathlib import Path
13
- from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
14
- from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
15
-
16
- def get_gpu_memory_info():
17
- """Get detailed GPU memory information"""
18
- if not torch.cuda.is_available():
19
- return {}
20
-
21
- # PyTorch memory stats
22
- allocated = torch.cuda.memory_allocated() / 1e9
23
- reserved = torch.cuda.memory_reserved() / 1e9
24
-
25
- # Get nvidia-smi info
26
- try:
27
- result = subprocess.run([
28
- 'nvidia-smi',
29
- '--query-gpu=memory.used,memory.total',
30
- '--format=csv,nounits,noheader'
31
- ], capture_output=True, text=True)
32
-
33
- if result.returncode == 0:
34
- used, total = map(int, result.stdout.strip().split(','))
35
- nvidia_used_gb = used / 1024 # Convert MB to GB
36
- nvidia_total_gb = total / 1024
37
- else:
38
- nvidia_used_gb = 0
39
- nvidia_total_gb = 0
40
- except:
41
- nvidia_used_gb = 0
42
- nvidia_total_gb = 0
43
-
44
- return {
45
- 'allocated': allocated,
46
- 'reserved': reserved,
47
- 'nvidia_smi': nvidia_used_gb,
48
- 'nvidia_total': nvidia_total_gb
49
- }
50
-
51
- def print_memory_report(label, before, after):
52
- """Print detailed memory usage report"""
53
- print(f"\n{label}:")
54
- print(f" PyTorch Allocated: {before['allocated']:.2f} GB → {after['allocated']:.2f} GB "
55
- f"(+{after['allocated'] - before['allocated']:.2f} GB)")
56
- print(f" PyTorch Reserved: {before['reserved']:.2f} GB → {after['reserved']:.2f} GB "
57
- f"(+{after['reserved'] - before['reserved']:.2f} GB)")
58
- print(f" nvidia-smi Total: {before['nvidia_smi']:.2f} GB → {after['nvidia_smi']:.2f} GB "
59
- f"(+{after['nvidia_smi'] - before['nvidia_smi']:.2f} GB)")
60
- print(f" Memory Overhead: {after['reserved'] - after['allocated']:.2f} GB (PyTorch cache)")
61
-
62
- def clear_gpu_memory():
63
- """Aggressively clear GPU memory"""
64
- gc.collect()
65
- if torch.cuda.is_available():
66
- torch.cuda.empty_cache()
67
- torch.cuda.synchronize()
68
- # Force memory pool cleanup
69
- torch.cuda.reset_peak_memory_stats()
70
-
71
- def test_model_memory(model_path, model_name):
72
- """Test model with detailed memory tracking"""
73
- print(f"\n{'='*70}")
74
- print(f"Testing {model_name}")
75
- print(f"{'='*70}")
76
-
77
- # Clear memory and get baseline
78
- clear_gpu_memory()
79
- time.sleep(2) # Let memory settle
80
-
81
- baseline = get_gpu_memory_info()
82
- print(f"\nBaseline GPU Memory:")
83
- print(f" PyTorch Allocated: {baseline['allocated']:.2f} GB")
84
- print(f" PyTorch Reserved: {baseline['reserved']:.2f} GB")
85
- print(f" nvidia-smi Shows: {baseline['nvidia_smi']:.2f} GB / {baseline['nvidia_total']:.2f} GB")
86
-
87
- # Load model
88
- print(f"\nLoading {model_name}...")
89
- load_start = time.time()
90
-
91
- processor = VibeVoiceProcessor.from_pretrained(model_path)
92
- model = VibeVoiceForConditionalGenerationInference.from_pretrained(
93
- model_path,
94
- device_map='cuda',
95
- torch_dtype=torch.bfloat16,
96
- )
97
- model.eval()
98
-
99
- load_time = time.time() - load_start
100
-
101
- # Get memory after loading
102
- loaded = get_gpu_memory_info()
103
- print_memory_report("After Model Loading", baseline, loaded)
104
-
105
- # Test generation to see peak usage
106
- print(f"\nTesting generation...")
107
- test_text = "Speaker 1: Testing memory usage. Speaker 2: Let's see the results!"
108
- voices_dir = "/home/deveraux/Desktop/vibevoice/VibeVoice-main/demo/voices"
109
- speaker_voices = [
110
- os.path.join(voices_dir, "en-Alice_woman.wav"),
111
- os.path.join(voices_dir, "en-Carter_man.wav")
112
- ]
113
-
114
- inputs = processor(
115
- text=[test_text],
116
- voice_samples=[speaker_voices],
117
- padding=True,
118
- return_tensors="pt",
119
- return_attention_mask=True,
120
- )
121
-
122
- # Monitor during generation
123
- pre_gen = get_gpu_memory_info()
124
-
125
- with torch.no_grad():
126
- outputs = model.generate(
127
- **inputs,
128
- max_new_tokens=None,
129
- cfg_scale=1.3,
130
- tokenizer=processor.tokenizer,
131
- generation_config={'do_sample': False},
132
- )
133
-
134
- post_gen = get_gpu_memory_info()
135
- print_memory_report("During Generation", pre_gen, post_gen)
136
-
137
- # Peak memory stats
138
- if torch.cuda.is_available():
139
- peak_memory = torch.cuda.max_memory_allocated() / 1e9
140
- peak_reserved = torch.cuda.max_memory_reserved() / 1e9
141
- print(f"\nPeak Memory Usage:")
142
- print(f" Peak Allocated: {peak_memory:.2f} GB")
143
- print(f" Peak Reserved: {peak_reserved:.2f} GB")
144
-
145
- # Clean up
146
- del model
147
- del processor
148
- clear_gpu_memory()
149
-
150
- return {
151
- 'name': model_name,
152
- 'allocated': loaded['allocated'] - baseline['allocated'],
153
- 'reserved': loaded['reserved'] - baseline['reserved'],
154
- 'nvidia_smi': loaded['nvidia_smi'] - baseline['nvidia_smi'],
155
- 'peak_allocated': peak_memory,
156
- 'peak_reserved': peak_reserved
157
- }
158
-
159
- def main():
160
- print("="*70)
161
- print("ACCURATE VRAM MEASUREMENT FOR VIBEVOICE")
162
- print("="*70)
163
- print("\nNote: PyTorch reserves extra memory for efficiency.")
164
- print("nvidia-smi shows total reserved memory, not just allocated.")
165
-
166
- models = [
167
- {
168
- "path": "/home/deveraux/Desktop/vibevoice/VibeVoice-Large-pt",
169
- "name": "16-bit Original"
170
- },
171
- {
172
- "path": "/home/deveraux/Desktop/vibevoice/VibeVoice-Large-4bit",
173
- "name": "4-bit Quantized"
174
- }
175
- ]
176
-
177
- results = []
178
- for model_info in models:
179
- try:
180
- result = test_model_memory(model_info["path"], model_info["name"])
181
- results.append(result)
182
- time.sleep(5)
183
- except Exception as e:
184
- print(f"Error testing {model_info['name']}: {e}")
185
-
186
- # Summary
187
- print("\n" + "="*70)
188
- print("MEMORY USAGE SUMMARY")
189
- print("="*70)
190
- print(f"\n{'Model':<20} {'Allocated':<12} {'Reserved':<12} {'nvidia-smi':<12} {'Peak':<12}")
191
- print("-"*70)
192
-
193
- for r in results:
194
- print(f"{r['name']:<20} "
195
- f"{r['allocated']:<12.2f} "
196
- f"{r['reserved']:<12.2f} "
197
- f"{r['nvidia_smi']:<12.2f} "
198
- f"{r['peak_allocated']:<12.2f}")
199
-
200
- print("\n💡 Key Insights:")
201
- print("- 'Allocated' = Actual model weights in memory")
202
- print("- 'Reserved' = Total GPU memory reserved by PyTorch (includes cache)")
203
- print("- 'nvidia-smi' = What nvidia-smi reports (includes all overhead)")
204
- print("- The difference is PyTorch's memory pool for efficiency")
205
-
206
- if __name__ == "__main__":
207
- main()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8bit/use_quantized_model.py DELETED
@@ -1,70 +0,0 @@
1
- #!/usr/bin/env python
2
- """
3
- Simple example of using the pre-quantized VibeVoice model
4
- No need for on-the-fly quantization - loads much faster!
5
- """
6
-
7
- import os
8
- import torch
9
- from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
10
- from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
11
-
12
- def main():
13
- # Path to the pre-quantized model
14
- model_path = "/home/deveraux/Desktop/vibevoice/VibeVoice-Large-4bit"
15
-
16
- print("Loading pre-quantized VibeVoice 4-bit model...")
17
-
18
- # Load processor
19
- processor = VibeVoiceProcessor.from_pretrained(model_path)
20
-
21
- # Load the pre-quantized model
22
- # The quantization config is already saved in the model
23
- model = VibeVoiceForConditionalGenerationInference.from_pretrained(
24
- model_path,
25
- device_map='cuda',
26
- torch_dtype=torch.bfloat16,
27
- )
28
- model.eval()
29
-
30
- # Check memory usage
31
- memory_gb = torch.cuda.memory_allocated() / 1e9
32
- print(f"✅ Model loaded! Memory usage: {memory_gb:.1f} GB")
33
-
34
- # Example generation
35
- text = "Speaker 1: Welcome to our podcast! Speaker 2: Thanks for having me!"
36
-
37
- # Voice samples (using demo voices)
38
- voices_dir = "/home/deveraux/Desktop/vibevoice/VibeVoice-main/demo/voices"
39
- speaker_voices = [
40
- os.path.join(voices_dir, "en-Alice_woman.wav"),
41
- os.path.join(voices_dir, "en-Carter_man.wav")
42
- ]
43
-
44
- # Process inputs
45
- inputs = processor(
46
- text=[text],
47
- voice_samples=[speaker_voices],
48
- padding=True,
49
- return_tensors="pt",
50
- return_attention_mask=True,
51
- )
52
-
53
- # Generate
54
- print(f"\nGenerating: '{text}'")
55
- with torch.no_grad():
56
- outputs = model.generate(
57
- **inputs,
58
- max_new_tokens=None,
59
- cfg_scale=1.3,
60
- tokenizer=processor.tokenizer,
61
- generation_config={'do_sample': False},
62
- )
63
-
64
- # Save output
65
- output_path = "quantized_output.wav"
66
- processor.save_audio(outputs.speech_outputs[0], output_path=output_path)
67
- print(f"✅ Audio saved to: {output_path}")
68
-
69
- if __name__ == "__main__":
70
- main()