Random sounds and music

#5
by jujutechnology - opened

I am getting random sounds and music with most generations, even with Speakers that don't have the bgm tag. Are others getting this?

Microsoft org
β€’
edited 7 days ago

Hi, @jujutechnology thx for the feedback.
The bgm or sounds are spontaneous, i.e., we can't control it to generate or not.
But we have some findings:

  • If the voice prompt contains bgm, the generated speech may appears bgm. (7B model is easy to handle this, see & use the demo in our code page)
  • If the voice prompt is clear (no bgm), but the input text contains some introduction words like ("Welcome to", "Hello", ..., "However/But"), the generated speech may also appears bgm.
  • Others, 1.5B model appears bgm in a medium prob (I'm not sure, depend on text). 7B is more stable for handling this condition (lower prob).

We don't optimize the model for short utterance, so one/two sentences input may not be stable (clean), you can have a try.

I met too. Sometimes it just happens,but most of time is ok.I think if you have too many speakers random music will come, only one speaker and with only one sentence is is better

It really really does have a proclivity to sing or produce music! Less so mid CFG (1.2-1.5)
@frontierai

  • If the voice prompt contains bgm, the generated speech may appears bgm. (7B model is easy to handle this, see & use the demo in our code page)
  • If the voice prompt is clear (no bgm), but the input text contains some introduction words like ("Welcome to", "Hello", ..., "However/But"), the generated speech may also appears bgm.
  • Others, 1.5B model appears bgm in a medium prob (I'm not sure, depend on text). 7B is more stable for handling this condition (lower prob).

Is there some way of post, pre-processing or transforming that can happen that would allow for a music/bgm toggle?

This comment has been hidden (marked as Resolved)

Here is the discord bot I used: does not fix singing, or background music but does fix the crazy nonsense generation that happen occasionally via whsiper

import discord
from discord import app_commands
import asyncio
import torch
import soundfile as sf
import numpy as np
import os
import sys
import importlib.metadata
import re
import random
import uuid
from typing import Optional, List, Dict, Tuple

def check_library_versions():
    """Checks for correct library versions and exits if they are wrong."""
    required_versions = {
        "transformers": "4.51.3",
        "accelerate": "1.6.0",
        "openai-whisper": "20231117",
        "rapidfuzz": "3.9.4"
    }
    print("Checking required library versions...")
    all_ok = True
    for lib, req_ver in required_versions.items():
        try:
            installed_ver = importlib.metadata.version(lib)
            if installed_ver != req_ver:
                if lib in ["openai-whisper", "rapidfuzz"]:
                    print(f"--> Mismatch: '{lib}'. Required: {req_ver}, Installed: {installed_ver}. This may cause issues.")
                else:
                    print(f"--> Mismatch: '{lib}'. Required: {req_ver}, Installed: {installed_ver}")
                    all_ok = False
        except importlib.metadata.PackageNotFoundError:
            print(f"--> ERROR: Required library '{lib}' is not installed.")
            all_ok = False
    
    if not all_ok:
        print("\nFATAL: Incorrect library versions detected. Please fix your environment:")
        print("pip install transformers==4.51.3 accelerate==1.6.0 openai-whisper==20231117 rapidfuzz==3.9.4")
        sys.exit(1)
    else:
        print("Library versions are correct.")

check_library_versions()

import whisper
from rapidfuzz import fuzz
from transformers.utils import logging
from VibeVoice.vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from VibeVoice.vibevoice.processor.vibevoice_processor import VibeVoiceProcessor
from diffusers import DPMSolverMultistepScheduler

logging.set_verbosity_error()

DISCORD_BOT_TOKEN = os.getenv("VIBEVOICE_TOKEN", "YOUR_DISCORD_BOT_TOKEN")
MODEL_PATH = "microsoft/VibeVoice-1.5B"
VOICES_DIRECTORY = "voices"
MAX_RETRIES = 7
VERIFICATION_THRESHOLD = 85

class WhisperVerifier:
    def __init__(self, device: str = "cuda"):
        print("Loading Whisper model for verification...")
        try:
            self.model = whisper.load_model("tiny.en", device=device)
            print("Whisper 'tiny.en' model loaded successfully.")
        except Exception as e:
            print(f"FATAL: Could not load Whisper model. Verification will be disabled. Error: {e}")
            self.model = None

    def _normalize_text(self, text: str) -> str:
        return re.sub(r'[^\w\s]', '', text).lower().strip()

    def audio_starts_with_prefix(self, audio_path: str, prefix: str) -> bool:
        if not self.model: return True 
        try:
            result = self.model.transcribe(audio_path, language='en', fp16=torch.cuda.is_available())
            transcribed_text = self._normalize_text(result['text'])
            expected_prefix = self._normalize_text(prefix.replace(":", ""))
            return transcribed_text.startswith(expected_prefix)
        except Exception as e:
            print(f"Whisper prefix verification failed for {audio_path}: {e}")
            return True

    def verify_audio_content(self, audio_path: str, expected_text: str) -> bool:
        if not self.model: return True
        try:
            result = self.model.transcribe(audio_path, language='en', fp16=torch.cuda.is_available())
            transcribed_text = self._normalize_text(result['text'])
            expected_text_normalized = self._normalize_text(expected_text)
            similarity = fuzz.ratio(transcribed_text, expected_text_normalized)
            print(f"Whisper content verification: Similarity={similarity:.2f}% for '{expected_text[:50]}...'")
            return similarity >= VERIFICATION_THRESHOLD
        except Exception as e:
            print(f"Whisper content verification failed for {audio_path}: {e}")
            return False

class VibeVoiceGenerator:
    def __init__(self, model_path: str, device: str = "cuda"):
        print("Loading VibeVoice model... This may take a moment.")
        if torch.cuda.is_available() and device == "cuda":
            self.device = "cuda"
            model_kwargs = {"torch_dtype": torch.bfloat16, "device_map": "cuda", "attn_implementation": "flash_attention_2"}
        else:
            self.device = "cpu"
            model_kwargs = {"torch_dtype": torch.float32, "device_map": "cpu"}
        try:
            self.processor = VibeVoiceProcessor.from_pretrained(model_path)
            self.model = VibeVoiceForConditionalGenerationInference.from_pretrained(model_path, **model_kwargs)
            self.model.eval()
            new_scheduler = DPMSolverMultistepScheduler.from_config(self.model.model.noise_scheduler.config, algorithm_type='sde-dpmsolver++', beta_schedule='squaredcos_cap_v2')
            self.model.model.noise_scheduler = new_scheduler
            self.model.set_ddpm_inference_steps(num_steps=10)
            print("VibeVoice model loaded successfully.")
        except Exception as e:
            print(f"FATAL: Failed to load model from {model_path}. Error: {e}")
            raise

    def generate_speech_chunk(self, text_with_prefix: str, voice_prompt: str, cfg_scale: float = 1.3, output_path: str = "temp_output.wav"):
        try:
            inputs = self.processor(text=[text_with_prefix], voice_samples=[[voice_prompt]], padding=True, return_tensors="pt", return_attention_mask=True)
            outputs = self.model.generate(**inputs, max_new_tokens=None, cfg_scale=cfg_scale, tokenizer=self.processor.tokenizer, generation_config={'do_sample': False}, verbose=False)
            self.processor.save_audio(outputs.speech_outputs[0], output_path=output_path)
            return output_path
        except Exception as e:
            print(f"Error during speech chunk generation: {e}")
            return None

class VibeVoiceBot(discord.Client):
    def __init__(self, *, intents: discord.Intents):
        super().__init__(intents=intents)
        self.tree = app_commands.CommandTree(self)
        self.generator: Optional[VibeVoiceGenerator] = None
        self.whisper_verifier: Optional[WhisperVerifier] = None
        self.voice_presets = {}

    async def setup_hook(self) -> None:
        print("Executing async setup_hook...")
        self.generator = await asyncio.to_thread(VibeVoiceGenerator, model_path=MODEL_PATH)
        self.whisper_verifier = await asyncio.to_thread(WhisperVerifier)
        print("Models loaded asynchronously.")
        self.load_voices()
        await self.tree.sync()
        print("Command tree synced.")

    def load_voices(self):
        if not os.path.isdir(VOICES_DIRECTORY):
            print(f"Warning: Voices directory '{VOICES_DIRECTORY}' not found. Please create it.")
            return
        
        wav_files = sorted([f for f in os.listdir(VOICES_DIRECTORY) if f.lower().endswith(".wav")])
        for filename in wav_files:
            name = os.path.splitext(filename)[0]
            path = os.path.join(VOICES_DIRECTORY, filename)
            self.voice_presets[name] = path

        if self.voice_presets:
            print(f"Loaded {len(self.voice_presets)} voices: {', '.join(self.voice_presets.keys())}")
        else:
            print("Warning: No .wav files found in the voices directory.")

intents = discord.Intents.default()
bot = VibeVoiceBot(intents=intents)



@bot
	.event
async def on_ready():
    print(f'Logged in as {bot.user} (ID: {bot.user.id})')
    print('------')

async def voice_autocomplete(interaction: discord.Interaction, current: str) -> List[app_commands.Choice[str]]:
    voices = bot.voice_presets.keys()
    return [app_commands.Choice(name=voice, value=voice) for voice in voices if current.lower() in voice.lower()][:25]

def parse_script_and_assign_voices(raw_text: str, available_voices: Dict[str, str], voice_selections: Dict[int, Optional[str]]) -> Tuple[List[Tuple[int, str]], Dict[int, str], str]:
    text_input = raw_text.strip().replace("’", "'").replace('\\n', '\n')
    speaker_pattern = re.compile(r'(Speaker\s+\d+\s*:)', re.IGNORECASE)
    parts = speaker_pattern.split(text_input)
    parsed_lines, found_speakers, current_speaker_id = [], set(), 0
    initial_text = parts[0].strip()
    if initial_text:
        parsed_lines.append((current_speaker_id, initial_text))
        found_speakers.add(current_speaker_id)
    for i in range(1, len(parts), 2):
        delimiter, text_content = parts[i], parts[i+1].strip() if (i+1) < len(parts) else ""
        speaker_id_match = re.search(r'\d+', delimiter)
        if speaker_id_match:
            speaker_id = int(speaker_id_match.group(0))
            if text_content:
                parsed_lines.append((speaker_id, text_content))
                found_speakers.add(speaker_id)
    if not parsed_lines and not found_speakers and text_input:
        parsed_lines.append((0, text_input))
        found_speakers.add(0)
    if not parsed_lines: return [], {}, ""
    sorted_speakers = sorted(list(found_speakers))
    voice_map, assigned_voice_paths = {}, set()
    for speaker_id in sorted_speakers:
        selected_voice_name = voice_selections.get(speaker_id)
        if selected_voice_name and selected_voice_name in available_voices:
            voice_path = available_voices[selected_voice_name]
            voice_map[speaker_id] = voice_path
            assigned_voice_paths.add(voice_path)
    
    sorted_available_voices = sorted(available_voices.items())
    remaining_voices = [path for name, path in sorted_available_voices if path not in assigned_voice_paths]
    
    for speaker_id in sorted_speakers:
        if speaker_id not in voice_map:
            if remaining_voices: voice_map[speaker_id] = remaining_voices.pop(0)
            elif available_voices: voice_map[speaker_id] = list(available_voices.values())[0]
            else: raise ValueError("No available voices to assign.")
    voice_name_map = {path: name for name, path in available_voices.items()}
    summary = ", ".join([f"Speaker {sp_id}: {voice_name_map.get(voice_map.get(sp_id), 'N/A')}" for sp_id in sorted_speakers])
    return parsed_lines, voice_map, summary

def trim_and_concatenate_chunks(chunk_details: List[Tuple[str, str]], final_path: str, verifier: WhisperVerifier) -> str:
    all_audio_data, sample_rate = [], None
    for path, prefix_text in chunk_details:
        try:
            audio_data, sr = sf.read(path, dtype='float32')
            if sample_rate is None: sample_rate = sr
            elif sample_rate != sr:
                print(f"Warning: Sample rate mismatch on {path}. Skipping.")
                continue
            
            if verifier.audio_starts_with_prefix(path, prefix_text):
                chars_per_second = 15
                duration_to_trim = len(prefix_text) / chars_per_second
                samples_to_trim = int(duration_to_trim * sample_rate)
                all_audio_data.append(audio_data[samples_to_trim:])
            else:
                print(f"Whisper verification: Prefix not detected in {path}. Using untrimmed audio.")
                all_audio_data.append(audio_data)
        finally:
            if os.path.exists(path): os.remove(path)
    if not all_audio_data: return None
    final_audio = np.concatenate(all_audio_data)
    sf.write(final_path, final_audio, sample_rate)
    return final_path



@bot
	.tree.command(name="generate", description="Generates an audio file from multi-speaker text.")
@app_commands.describe(text="The script to convert to speech.", voice_0="Optional: The voice for Speaker 0.", voice_1="Optional: The voice for Speaker 1.", voice_2="Optional: The voice for Speaker 2.", voice_3="Optional: The voice for Speaker 3.", cfg_scale="Guidance strength (1.0-2.0). Default is 1.3.")
@app_commands.autocomplete(voice_0=voice_autocomplete, voice_1=voice_autocomplete, voice_2=voice_autocomplete, voice_3=voice_autocomplete)
async def generate(interaction: discord.Interaction, text: str, voice_0: Optional[str] = None, voice_1: Optional[str] = None, voice_2: Optional[str] = None, voice_3: Optional[str] = None, cfg_scale: Optional[float] = 1.3):
    await interaction.response.defer(thinking=True)
    if not bot.voice_presets:
        await interaction.followup.send("Error: No voice presets are loaded.", ephemeral=True)
        return
    voice_selections = {0: voice_0, 1: voice_1, 2: voice_2, 3: voice_3}
    if not (1.0 <= cfg_scale <= 2.0):
        await interaction.followup.send("Please choose a `cfg_scale` value between 1.0 and 2.0.", ephemeral=True)
        return
    
    final_output_file = None
    try:
        parsed_lines, voice_map, voice_summary = parse_script_and_assign_voices(text, bot.voice_presets, voice_selections)
        if not parsed_lines:
            await interaction.followup.send("Error: The provided text contains no valid dialogue lines to generate.", ephemeral=True)
            return
        
        chunk_details = []
        for i, (speaker_id, line_text) in enumerate(parsed_lines):
            prefix = f"Speaker {speaker_id}: "
            text_with_prefix = f"{prefix}{line_text}"
            voice_prompt_for_line = voice_map[speaker_id]
            
            generated_chunk_path = None
            for attempt in range(MAX_RETRIES):
                chunk_path = f"temp_chunk_{uuid.uuid4()}.wav"
                temp_path = await asyncio.to_thread(bot.generator.generate_speech_chunk, text_with_prefix, voice_prompt_for_line, cfg_scale, chunk_path)
                if not temp_path:
                    print(f"Attempt {attempt + 1}/{MAX_RETRIES} for line {i+1} failed during generation.")
                    await asyncio.sleep(1)
                    continue
                is_verified = await asyncio.to_thread(bot.whisper_verifier.verify_audio_content, temp_path, line_text)
                if is_verified:
                    generated_chunk_path = temp_path
                    break 
                else:
                    print(f"Attempt {attempt + 1}/{MAX_RETRIES} for line {i+1} failed content verification.")
                    os.remove(temp_path)
                    await asyncio.sleep(1)
            
            if generated_chunk_path:
                chunk_details.append((generated_chunk_path, prefix))
            else:
                raise Exception(f"Failed to generate and verify audio for line '{line_text[:50]}...' after {MAX_RETRIES} attempts.")
        
        if not chunk_details: raise Exception("Audio generation produced no output files.")
        final_output_file = f"final_output_{uuid.uuid4()}.wav"
        await asyncio.to_thread(trim_and_concatenate_chunks, chunk_details, final_output_file, bot.whisper_verifier)
        clean_display_text = text.replace('\\n', '\n')
        response_message = f"πŸŽ™οΈ Audio generated.\n**Voices:** {voice_summary}\n**Script:** \"{clean_display_text[:800]}...\""
        await interaction.followup.send(content=response_message, file=discord.File(final_output_file))
    except Exception as e:
        await interaction.followup.send(f"An error occurred: {e}", ephemeral=True)
    finally:
        if final_output_file and os.path.exists(final_output_file): os.remove(final_output_file)

if __name__ == "__main__":
    if sys.platform == "win32":
        asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
    if DISCORD_BOT_TOKEN == "YOUR_DISCORD_BOT_TOKEN":
        print("ERROR: Remember to replace 'YOUR_DISCORD_BOT_TOKEN' with your one from the developer portal.")
    else:
        bot.run(DISCORD_BOT_TOKEN)  

An example discord bot that uses whisper to check the output, to reiterate: It does not fix singing, or background music but does fix the crazy nonsense generation that occasionally happen (and sometimes happen several times in a row)

One presumes that the larger model plus a verification of this nature would solve the issue

I was also- very occasionally - getting "Speaker 1:" et cetera in the text outputs - this also elides such a thing - this check can perhaps be made optional ?

Input

Speaker 1: the cat sat on a map beside a banana plant Speaker 0: a car, a cart, and a garden are parked by the barn Speaker 2: they took the path through the forest to the castle Speaker 3: water, butter, and tomato were added to the pasta Speaker 1: garage or garbage, it depends on how you say it Speaker 1: please record the data and check the schedule again Speaker 0: neither the root nor the roof was easy to navigate Speaker 1: he read the book on privacy, vitamins and aluminum Speaker 2: the elevator stopped at every floor of the theater Speaker 1: some colors vary, others stay the same world-wide

example output of vibeVoice.py
Checking required library versions...
--> Mismatch: 'rapidfuzz'. Required: 3.9.4, Installed: 3.10.0. This may cause issues.
Library versions are correct.
...
2025-08-26 14:23:37 INFO discord.client logging in using static token
Executing async setup_hook...
Loading VibeVoice model... This may take a moment.
No preprocessor_config.json found at microsoft/VibeVoice-1.5B, using defaults
Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 3/3 [00:03<00:00, 1.04s/it]
VibeVoice model loaded successfully.
Loading Whisper model for verification...
...
Whisper 'tiny.en' model loaded successfully.
Models loaded asynchronously.
Loaded 4 voices: test.wav-p217, test.wav-p246, test.wav-p266, test_master_voice
Command tree synced.
2025-08-26 14:23:44 INFO discord.gateway Shard ID None has connected to Gateway (Session ID: ...9fa5).
Logged in as voxbot#6015 (ID: ...9)

Whisper content verification: Similarity=70.59% for 'the cat sat on a map beside a banana plant...'
Attempt 1/7 for line 1 failed content verification.
Whisper content verification: Similarity=90.24% for 'the cat sat on a map beside a banana plant...'
Whisper content verification: Similarity=92.47% for 'a car, a cart, and a garden are parked by the barn...'
Whisper content verification: Similarity=100.00% for 'they took the path through the forest to the castl...'
Whisper content verification: Similarity=25.41% for 'water, butter, and tomato were added to the pasta...'
Attempt 1/7 for line 4 failed content verification.
Whisper content verification: Similarity=61.86% for 'water, butter, and tomato were added to the pasta...'
Attempt 2/7 for line 4 failed content verification.
Whisper content verification: Similarity=94.62% for 'water, butter, and tomato were added to the pasta...'
Whisper content verification: Similarity=95.65% for 'garage or garbage, it depends on how you say it...'
Whisper content verification: Similarity=100.00% for 'please record the data and check the schedule agai...'
Whisper content verification: Similarity=97.03% for 'neither the root nor the roof was easy to navigate...'
Whisper content verification: Similarity=100.00% for 'he read the book on privacy, vitamins and aluminum...'
Whisper content verification: Similarity=98.00% for 'the elevator stopped at every floor of the theater...'
Whisper content verification: Similarity=96.84% for 'some colors vary, others stay the same world-wide...'
Whisper verification: Prefix not detected in temp_chunk_721f46b0-5ade-4024-99f5-069470cae5ed.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_fa9d56c5-00a7-47bc-87f3-f77196bd1424.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_e26d2808-473d-48b0-9f41-f3f874847b1d.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_f51421fa-b5b7-48fc-a28c-5d1ca6bb2f0a.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_1a79e1cc-90e7-41d5-b103-e91228ca983c.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_3b37b7c6-7a9f-4e6d-b95d-9de4a8ef5731.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_655003e9-c3e6-42c8-b1b4-c20871995804.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_2f3e89c5-6c58-4cfa-95fb-df2d4a4410c9.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_eaa09f2c-3199-4c83-9de2-e24a1c58802e.wav. Using untrimmed audio.
Whisper verification: Prefix not detected in temp_chunk_6ee1b683-eb77-4019-99a4-088806f97178.wav. Using untrimmed audio.

As you can see this does not fix the bgm issue. nor does it consistently use the voices that have been set - seemingly?. but it is "better?" than the original - somewhat

Thx for the feedback. I collect it in github README FAQ.

Sign up or log in to comment