Possibly helpful info' for Windows users wanting to run this locally.
I was having a lot of trouble trying to run this locally on Windows 11, mainly due to problems building flash attention (which it requires).
After doing some research online I was pointed to this GitHub page with pre-built versions, and upon installing the correct one for my situation, with regard to Torch, CUDA and Python versions, I was able to get the VibeVoice Gradio demo running locally.
https://github.com/kingbri1/flash-attention/releases
Hope this helps someone.
And thanks to Microsoft for releasing this. It's just the sort of thing I was after, and I could not get alternatives working locally.
Also, the larger 7B version model (available at the link below) runs just fine on an RTX 3090 with 24GB VRAM.
Thank you so much for taking the time to share your experience and this valuable solution!
Your tip about the pre-built versions of flash attention is incredibly helpful for others who might be trying to run this locally on Windows 11. We're thrilled to hear that you were able to get the VibeVoice Gradio demo running, and we appreciate you confirming that the larger model runs well on an RTX 3090.
To make this information more visible to everyone in the community, would you be willing to contribute to the github repository? If you're open to it, submitting a Pull Request (PR) to update the README with this information would be fantastic. Alternatively, you could open an issue detailing the solution.
This would be a great way to formally incorporate your findings and help future users who might run into the same issue.
Thanks again for your contribution
https://github.com/Dao-AILab/flash-attention/releases
FWIW building flash attention is generally difficult on windows because the required tools are not linked properly in the environment variables or the build system - typically Visual Studio provided - is missing or you have more than one build system and its using the wrong one.
try https://learn.microsoft.com/en-us/visualstudio/releases/2022/release-notes-v17.4 and ensure it is the first thing in your PATH
Often times the laborious exercise of ensuring the correct version of the things you wish to build against are listed FIRST in your enviroment variables can do the trick
Ensure the correct version that you wish to link against is ABOVE the version you don't want to use in your PATH - for example
CUDA Toolkit 12.1 is the one you want. as you can see 12.2 works okay as well.
Install Ninja or be prepared for it to take all night.
try python -m pip install flash-attn --no-build-isolation
the --no-build-isolation is important
expect it to take many hours unless your machine is insanely fast
I have built for my machine probably hundreds of time (God help me) and also used wheels. Just use a wheel if you can there is little - if any - difference.
hope it helps
It's very helpful!