microsoft/VibeVoice-1.5B · VibeVoice's Singing and Music Capabilities

7 days ago

Doing some digging into 2 points:

Singing Generation:

How does the model generate singing?
The technical report and demo code lack specific training or logic for singing. in another thread it is mentioned by @frontierai that " bgm or sounds are spontaneous, i.e., we can't control it" - is this an emergent capability of the Qwen2.5 LLM interpreting lyrical structure?

Then Background Music:
How was the "Podcast with Background Music" demo produced?
The documentation explicitly states the model does not handle music, however the Demo proves the opposite! I cannot replicate it keeping the same music in the background for more than a few seconds, or I seem to get it changing music per speaker without any consistency?
see:

Whilst there are some tips on avoiding bgm, are there any for INCLUDING bgm/or any music?

great model though - had some fun playing with it,

PsiPi

7 days ago

frontierai
Microsoft org
about 6 hours ago
•
edited about 6 hours ago

Hi, @jujutechnology thx for the feedback.
The bgm or sounds are spontaneous, i.e., we can't control it to generate or not.
But we have some findings:

If the voice prompt contains bgm, the generated speech may appears bgm. (7B model is easy to handle this, see & use the demo in our code page)
If the voice prompt is clear (no bgm), but the input text contains some introduction words like ("Welcome to", "Hello", ..., "However/But"), the generated speech may also appears bgm.
Others, 1.5B model appears bgm in a medium prob (I'm not sure, depend on text). 7B is more stable for handling this condition (lower prob).

We don't optimize the model for short utterance, so one/two sentences input may not be stable (clean), you can have a try.

was written in discussion #5

here https://huggingface.co/microsoft/VibeVoice-1.5B/discussions/5#68ad6f80380355a3dcf830a0

zzliang

7 days ago

See FAQ in our updated github README.
Maybe we can have more timely communication on github issue.