VibeVoice's Singing and Music Capabilities
Doing some digging into 2 points:
Singing Generation:
How does the model generate singing?
The technical report and demo code lack specific training or logic for singing. in another thread it is mentioned by
@frontierai
that " bgm or sounds are spontaneous, i.e., we can't control it" - is this an emergent capability of the Qwen2.5 LLM interpreting lyrical structure?
Then Background Music:
How was the "Podcast with Background Music" demo produced?
The documentation explicitly states the model does not handle music, however the Demo proves the opposite! I cannot replicate it keeping the same music in the background for more than a few seconds, or I seem to get it changing music per speaker without any consistency?
see:
Whilst there are some tips on avoiding bgm, are there any for INCLUDING bgm/or any music?
great model though - had some fun playing with it,
frontierai
Microsoft org
about 6 hours ago
β’
edited about 6 hours ago
Hi,
@jujutechnology
thx for the feedback.
The bgm or sounds are spontaneous, i.e., we can't control it to generate or not.
But we have some findings:
If the voice prompt contains bgm, the generated speech may appears bgm. (7B model is easy to handle this, see & use the demo in our code page)
If the voice prompt is clear (no bgm), but the input text contains some introduction words like ("Welcome to", "Hello", ..., "However/But"), the generated speech may also appears bgm.
Others, 1.5B model appears bgm in a medium prob (I'm not sure, depend on text). 7B is more stable for handling this condition (lower prob).
We don't optimize the model for short utterance, so one/two sentences input may not be stable (clean), you can have a try.
was written in discussion #5
here https://huggingface.co/microsoft/VibeVoice-1.5B/discussions/5#68ad6f80380355a3dcf830a0