Random sounds and music
I am getting random sounds and music with most generations, even with Speakers that don't have the bgm tag. Are others getting this?
Hi,
@jujutechnology
thx for the feedback.
The bgm or sounds are spontaneous, i.e., we can't control it to generate or not.
But we have some findings:
- If the voice prompt contains bgm, the generated speech may appears bgm. (7B model is easy to handle this, see & use the demo in our code page)
- If the voice prompt is clear (no bgm), but the input text contains some introduction words like ("Welcome to", "Hello", ..., "However/But"), the generated speech may also appears bgm.
- Others, 1.5B model appears bgm in a medium prob (I'm not sure, depend on text). 7B is more stable for handling this condition (lower prob).
We don't optimize the model for short utterance, so one/two sentences input may not be stable (clean), you can have a try.
I met too. Sometimes it just happens,but most of time is ok.I think if you have too many speakers random music will come, only one speaker and with only one sentence is is better
It really really does have a proclivity to sing or produce music! Less so mid CFG (1.2-1.5)
@frontierai
- If the voice prompt contains bgm, the generated speech may appears bgm. (7B model is easy to handle this, see & use the demo in our code page)
- If the voice prompt is clear (no bgm), but the input text contains some introduction words like ("Welcome to", "Hello", ..., "However/But"), the generated speech may also appears bgm.
- Others, 1.5B model appears bgm in a medium prob (I'm not sure, depend on text). 7B is more stable for handling this condition (lower prob).
Is there some way of post, pre-processing or transforming that can happen that would allow for a music/bgm toggle?
Here is the discord bot I used: does not fix singing, or background music but does fix the crazy nonsense generation that happen occasionally via whsiper
import discord
from discord import app_commands
import asyncio
import torch
import soundfile as sf
import numpy as np
import os
import sys
import importlib.metadata
import re
import random
import uuid
from typing import Optional, List, Dict, Tuple
def check_library_versions():
"""Checks for correct library versions and exits if they are wrong."""
required_versions = {
"transformers": "4.51.3",
"accelerate": "1.6.0",
"openai-whisper": "20231117",
"rapidfuzz": "3.9.4"
}
print("Checking required library versions...")
all_ok = True
for lib, req_ver in required_versions.items():
try:
installed_ver = importlib.metadata.version(lib)
if installed_ver != req_ver:
if lib in ["openai-whisper", "rapidfuzz"]:
print(f"--> Mismatch: '{lib}'. Required: {req_ver}, Installed: {installed_ver}. This may cause issues.")
else:
print(f"--> Mismatch: '{lib}'. Required: {req_ver}, Installed: {installed_ver}")
all_ok = False
except importlib.metadata.PackageNotFoundError:
print(f"--> ERROR: Required library '{lib}' is not installed.")
all_ok = False
if not all_ok:
print("\nFATAL: Incorrect library versions detected. Please fix your environment:")
print("pip install transformers==4.51.3 accelerate==1.6.0 openai-whisper==20231117 rapidfuzz==3.9.4")
sys.exit(1)
else:
print("Library versions are correct.")
check_library_versions()
import whisper
from rapidfuzz import fuzz
from transformers.utils import logging
from VibeVoice.vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from VibeVoice.vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
from diffusers import DPMSolverMultistepScheduler
logging.set_verbosity_error()
DISCORD_BOT_TOKEN = os.getenv("VIBEVOICE_TOKEN", "YOUR_DISCORD_BOT_TOKEN")
MODEL_PATH = "microsoft/VibeVoice-1.5B"
VOICES_DIRECTORY = "voices"
MAX_RETRIES = 7
VERIFICATION_THRESHOLD = 85
class WhisperVerifier:
def __init__(self, device: str = "cuda"):
print("Loading Whisper model for verification...")
try:
self.model = whisper.load_model("tiny.en", device=device)
print("Whisper 'tiny.en' model loaded successfully.")
except Exception as e:
print(f"FATAL: Could not load Whisper model. Verification will be disabled. Error: {e}")
self.model = None
def _normalize_text(self, text: str) -> str:
return re.sub(r'[^\w\s]', '', text).lower().strip()
def audio_starts_with_prefix(self, audio_path: str, prefix: str) -> bool:
if not self.model: return True
try:
result = self.model.transcribe(audio_path, language='en', fp16=torch.cuda.is_available())
transcribed_text = self._normalize_text(result['text'])
expected_prefix = self._normalize_text(prefix.replace(":", ""))
return transcribed_text.startswith(expected_prefix)
except Exception as e:
print(f"Whisper prefix verification failed for {audio_path}: {e}")
return True
def verify_audio_content(self, audio_path: str, expected_text: str) -> bool:
if not self.model: return True
try:
result = self.model.transcribe(audio_path, language='en', fp16=torch.cuda.is_available())
transcribed_text = self._normalize_text(result['text'])
expected_text_normalized = self._normalize_text(expected_text)
similarity = fuzz.ratio(transcribed_text, expected_text_normalized)
print(f"Whisper content verification: Similarity={similarity:.2f}% for '{expected_text[:50]}...'")
return similarity >= VERIFICATION_THRESHOLD
except Exception as e:
print(f"Whisper content verification failed for {audio_path}: {e}")
return False
class VibeVoiceGenerator:
def __init__(self, model_path: str, device: str = "cuda"):
print("Loading VibeVoice model... This may take a moment.")
if torch.cuda.is_available() and device == "cuda":
self.device = "cuda"
model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "cuda", "attn_implementation": "flash_attention_2"}
else:
self.device = "cpu"
model_kwargs = {"torch_dtype": torch.float32, "device_map": "cpu"}
try:
self.processor = VibeVoiceProcessor.from_pretrained(model_path)
self.model = VibeVoiceForConditionalGenerationInference.from_pretrained(model_path, **model_kwargs)
self.model.eval()
new_scheduler = DPMSolverMultistepScheduler.from_config(self.model.model.noise_scheduler.config, algorithm_type='sde-dpmsolver++', beta_schedule='squaredcos_cap_v2')
self.model.model.noise_scheduler = new_scheduler
self.model.set_ddpm_inference_steps(num_steps=10)
print("VibeVoice model loaded successfully.")
except Exception as e:
print(f"FATAL: Failed to load model from {model_path}. Error: {e}")
raise
def generate_speech_chunk(self, text_with_prefix: str, voice_prompt: str, cfg_scale: float = 1.3, output_path: str = "temp_output.wav"):
try:
inputs = self.processor(text=[text_with_prefix], voice_samples=[[voice_prompt]], padding=True, return_tensors="pt", return_attention_mask=True)
outputs = self.model.generate(**inputs, max_new_tokens=None, cfg_scale=cfg_scale, tokenizer=self.processor.tokenizer, generation_config={'do_sample': False}, verbose=False)
self.processor.save_audio(outputs.speech_outputs[0], output_path=output_path)
return output_path
except Exception as e:
print(f"Error during speech chunk generation: {e}")
return None
class VibeVoiceBot(discord.Client):
def __init__(self, *, intents: discord.Intents):
super().__init__(intents=intents)
self.tree = app_commands.CommandTree(self)
self.generator: Optional[VibeVoiceGenerator] = None
self.whisper_verifier: Optional[WhisperVerifier] = None
self.voice_presets = {}
async def setup_hook(self) -> None:
print("Executing async setup_hook...")
self.generator = await asyncio.to_thread(VibeVoiceGenerator, model_path=MODEL_PATH)
self.whisper_verifier = await asyncio.to_thread(WhisperVerifier)
print("Models loaded asynchronously.")
self.load_voices()
await self.tree.sync()
print("Command tree synced.")
def load_voices(self):
if not os.path.isdir(VOICES_DIRECTORY):
print(f"Warning: Voices directory '{VOICES_DIRECTORY}' not found. Please create it.")
return
wav_files = sorted([f for f in os.listdir(VOICES_DIRECTORY) if f.lower().endswith(".wav")])
for filename in wav_files:
name = os.path.splitext(filename)[0]
path = os.path.join(VOICES_DIRECTORY, filename)
self.voice_presets[name] = path
if self.voice_presets:
print(f"Loaded {len(self.voice_presets)} voices: {', '.join(self.voice_presets.keys())}")
else:
print("Warning: No .wav files found in the voices directory.")
intents = discord.Intents.default()
bot = VibeVoiceBot(intents=intents)
@bot
.event
async def on_ready():
print(f'Logged in as {bot.user} (ID: {bot.user.id})')
print('------')
async def voice_autocomplete(interaction: discord.Interaction, current: str) -> List[app_commands.Choice[str]]:
voices = bot.voice_presets.keys()
return [app_commands.Choice(name=voice, value=voice) for voice in voices if current.lower() in voice.lower()][:25]
def parse_script_and_assign_voices(raw_text: str, available_voices: Dict[str, str], voice_selections: Dict[int, Optional[str]]) -> Tuple[List[Tuple[int, str]], Dict[int, str], str]:
text_input = raw_text.strip().replace("β", "'").replace('\\n', '\n')
speaker_pattern = re.compile(r'(Speaker\s+\d+\s*:)', re.IGNORECASE)
parts = speaker_pattern.split(text_input)
parsed_lines, found_speakers, current_speaker_id = [], set(), 0
initial_text = parts[0].strip()
if initial_text:
parsed_lines.append((current_speaker_id, initial_text))
found_speakers.add(current_speaker_id)
for i in range(1, len(parts), 2):
delimiter, text_content = parts[i], parts[i+1].strip() if (i+1) < len(parts) else ""
speaker_id_match = re.search(r'\d+', delimiter)
if speaker_id_match:
speaker_id = int(speaker_id_match.group(0))
if text_content:
parsed_lines.append((speaker_id, text_content))
found_speakers.add(speaker_id)
if not parsed_lines and not found_speakers and text_input:
parsed_lines.append((0, text_input))
found_speakers.add(0)
if not parsed_lines: return [], {}, ""
sorted_speakers = sorted(list(found_speakers))
voice_map, assigned_voice_paths = {}, set()
for speaker_id in sorted_speakers:
selected_voice_name = voice_selections.get(speaker_id)
if selected_voice_name and selected_voice_name in available_voices:
voice_path = available_voices[selected_voice_name]
voice_map[speaker_id] = voice_path
assigned_voice_paths.add(voice_path)
sorted_available_voices = sorted(available_voices.items())
remaining_voices = [path for name, path in sorted_available_voices if path not in assigned_voice_paths]
for speaker_id in sorted_speakers:
if speaker_id not in voice_map:
if remaining_voices: voice_map[speaker_id] = remaining_voices.pop(0)
elif available_voices: voice_map[speaker_id] = list(available_voices.values())[0]
else: raise ValueError("No available voices to assign.")
voice_name_map = {path: name for name, path in available_voices.items()}
summary = ", ".join([f"Speaker {sp_id}: {voice_name_map.get(voice_map.get(sp_id), 'N/A')}" for sp_id in sorted_speakers])
return parsed_lines, voice_map, summary
def trim_and_concatenate_chunks(chunk_details: List[Tuple[str, str]], final_path: str, verifier: WhisperVerifier) -> str:
all_audio_data, sample_rate = [], None
for path, prefix_text in chunk_details:
try:
audio_data, sr = sf.read(path, dtype='float32')
if sample_rate is None: sample_rate = sr
elif sample_rate != sr:
print(f"Warning: Sample rate mismatch on {path}. Skipping.")
continue
if verifier.audio_starts_with_prefix(path, prefix_text):
chars_per_second = 15
duration_to_trim = len(prefix_text) / chars_per_second
samples_to_trim = int(duration_to_trim * sample_rate)
all_audio_data.append(audio_data[samples_to_trim:])
else:
print(f"Whisper verification: Prefix not detected in {path}. Using untrimmed audio.")
all_audio_data.append(audio_data)
finally:
if os.path.exists(path): os.remove(path)
if not all_audio_data: return None
final_audio = np.concatenate(all_audio_data)
sf.write(final_path, final_audio, sample_rate)
return final_path
@bot
.tree.command(name="generate", description="Generates an audio file from multi-speaker text.")
@app_commands.describe(text="The script to convert to speech.", voice_0="Optional: The voice for Speaker 0.", voice_1="Optional: The voice for Speaker 1.", voice_2="Optional: The voice for Speaker 2.", voice_3="Optional: The voice for Speaker 3.", cfg_scale="Guidance strength (1.0-2.0). Default is 1.3.")
@app_commands.autocomplete(voice_0=voice_autocomplete, voice_1=voice_autocomplete, voice_2=voice_autocomplete, voice_3=voice_autocomplete)
async def generate(interaction: discord.Interaction, text: str, voice_0: Optional[str] = None, voice_1: Optional[str] = None, voice_2: Optional[str] = None, voice_3: Optional[str] = None, cfg_scale: Optional[float] = 1.3):
await interaction.response.defer(thinking=True)
if not bot.voice_presets:
await interaction.followup.send("Error: No voice presets are loaded.", ephemeral=True)
return
voice_selections = {0: voice_0, 1: voice_1, 2: voice_2, 3: voice_3}
if not (1.0 <= cfg_scale <= 2.0):
await interaction.followup.send("Please choose a `cfg_scale` value between 1.0 and 2.0.", ephemeral=True)
return
final_output_file = None
try:
parsed_lines, voice_map, voice_summary = parse_script_and_assign_voices(text, bot.voice_presets, voice_selections)
if not parsed_lines:
await interaction.followup.send("Error: The provided text contains no valid dialogue lines to generate.", ephemeral=True)
return
chunk_details = []
for i, (speaker_id, line_text) in enumerate(parsed_lines):
prefix = f"Speaker {speaker_id}: "
text_with_prefix = f"{prefix}{line_text}"
voice_prompt_for_line = voice_map[speaker_id]
generated_chunk_path = None
for attempt in range(MAX_RETRIES):
chunk_path = f"temp_chunk_{uuid.uuid4()}.wav"
temp_path = await asyncio.to_thread(bot.generator.generate_speech_chunk, text_with_prefix, voice_prompt_for_line, cfg_scale, chunk_path)
if not temp_path:
print(f"Attempt {attempt + 1}/{MAX_RETRIES} for line {i+1} failed during generation.")
await asyncio.sleep(1)
continue
is_verified = await asyncio.to_thread(bot.whisper_verifier.verify_audio_content, temp_path, line_text)
if is_verified:
generated_chunk_path = temp_path
break
else:
print(f"Attempt {attempt + 1}/{MAX_RETRIES} for line {i+1} failed content verification.")
os.remove(temp_path)
await asyncio.sleep(1)
if generated_chunk_path:
chunk_details.append((generated_chunk_path, prefix))
else:
raise Exception(f"Failed to generate and verify audio for line '{line_text[:50]}...' after {MAX_RETRIES} attempts.")
if not chunk_details: raise Exception("Audio generation produced no output files.")
final_output_file = f"final_output_{uuid.uuid4()}.wav"
await asyncio.to_thread(trim_and_concatenate_chunks, chunk_details, final_output_file, bot.whisper_verifier)
clean_display_text = text.replace('\\n', '\n')
response_message = f"ποΈ Audio generated.\n**Voices:** {voice_summary}\n**Script:** \"{clean_display_text[:800]}...\""
await interaction.followup.send(content=response_message, file=discord.File(final_output_file))
except Exception as e:
await interaction.followup.send(f"An error occurred: {e}", ephemeral=True)
finally:
if final_output_file and os.path.exists(final_output_file): os.remove(final_output_file)
if __name__ == "__main__":
if sys.platform == "win32":
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
if DISCORD_BOT_TOKEN == "YOUR_DISCORD_BOT_TOKEN":
print("ERROR: Remember to replace 'YOUR_DISCORD_BOT_TOKEN' with your one from the developer portal.")
else:
bot.run(DISCORD_BOT_TOKEN)
An example discord bot that uses whisper to check the output, to reiterate: It does not fix singing, or background music but does fix the crazy nonsense generation that occasionally happen (and sometimes happen several times in a row)
One presumes that the larger model plus a verification of this nature would solve the issue
I was also- very occasionally - getting "Speaker 1:" et cetera in the text outputs - this also elides such a thing - this check can perhaps be made optional ?
Input
Speaker 1: the cat sat on a map beside a banana plant Speaker 0: a car, a cart, and a garden are parked by the barn Speaker 2: they took the path through the forest to the castle Speaker 3: water, butter, and tomato were added to the pasta Speaker 1: garage or garbage, it depends on how you say it Speaker 1: please record the data and check the schedule again Speaker 0: neither the root nor the roof was easy to navigate Speaker 1: he read the book on privacy, vitamins and aluminum Speaker 2: the elevator stopped at every floor of the theater Speaker 1: some colors vary, others stay the same world-wide
example output of vibeVoice.py
Checking required library versions...
--> Mismatch: 'rapidfuzz'. Required: 3.9.4, Installed: 3.10.0. This may cause issues.
Library versions are correct.
...
2025-08-26 14:23:37 INFO discord.client logging in using static token
Executing async setup_hook...
Loading VibeVoice model... This may take a moment.
No preprocessor_config.json found at microsoft/VibeVoice-1.5B, using defaults
Loading checkpoint shards: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 3/3 [00:03<00:00, 1.04s/it]
VibeVoice model loaded successfully.
Loading Whisper model for verification...
...
Whisper 'tiny.en' model loaded successfully.
Models loaded asynchronously.
Loaded 4 voices: test.wav-p217, test.wav-p246, test.wav-p266, test_master_voice
Command tree synced.
2025-08-26 14:23:44 INFO discord.gateway Shard ID None has connected to Gateway (Session ID: ...9fa5).
Logged in as voxbot#6015 (ID: ...9)
Whisper content verification: Similarity=70.59% for 'the cat sat on a map beside a banana plant...'
Attempt 1/7 for line 1 failed content verification.
Whisper content verification: Similarity=90.24% for 'the cat sat on a map beside a banana plant...'
Whisper content verification: Similarity=92.47% for 'a car, a cart, and a garden are parked by the barn...'
Whisper content verification: Similarity=100.00% for 'they took the path through the forest to the castl...'
Whisper content verification: Similarity=25.41% for 'water, butter, and tomato were added to the pasta...'
Attempt 1/7 for line 4 failed content verification.
Whisper content verification: Similarity=61.86% for 'water, butter, and tomato were added to the pasta...'
Attempt 2/7 for line 4 failed content verification.
Whisper content verification: Similarity=94.62% for 'water, butter, and tomato were added to the pasta...'
Whisper content verification: Similarity=95.65% for 'garage or garbage, it depends on how you say it...'
Whisper content verification: Similarity=100.00% for 'please record the data and check the schedule agai...'
Whisper content verification: Similarity=97.03% for 'neither the root nor the roof was easy to navigate...'
Whisper content verification: Similarity=100.00% for 'he read the book on privacy, vitamins and aluminum...'
Whisper content verification: Similarity=98.00% for 'the elevator stopped at every floor of the theater...'
Whisper content verification: Similarity=96.84% for 'some colors vary, others stay the same world-wide...'
Whisper verification: Prefix not detected in temp_chunk_721f46b0-5ade-4024-99f5-069470cae5ed.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_fa9d56c5-00a7-47bc-87f3-f77196bd1424.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_e26d2808-473d-48b0-9f41-f3f874847b1d.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_f51421fa-b5b7-48fc-a28c-5d1ca6bb2f0a.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_1a79e1cc-90e7-41d5-b103-e91228ca983c.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_3b37b7c6-7a9f-4e6d-b95d-9de4a8ef5731.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_655003e9-c3e6-42c8-b1b4-c20871995804.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_2f3e89c5-6c58-4cfa-95fb-df2d4a4410c9.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_eaa09f2c-3199-4c83-9de2-e24a1c58802e.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_6ee1b683-eb77-4019-99a4-088806f97178.wav. Using untrimmed audio.
As you can see this does not fix the bgm issue. nor does it consistently use the voices that have been set - seemingly?. but it is "better?" than the original - somewhat