sdk: gradio
sdk_version: 5.36.2
Whisper-WebUI
A Gradio-based browser interface for Whisper
Features
- Select the Whisper implementation you want to use between:- openai/whisper
- SYSTRAN/faster-whisper (used by default)
- Vaibhavs10/insanely-fast-whisper
 
- Generate transcriptions from various sources, including files & microphone
- Currently supported output formats: csv, srt & txt
- Speech to Text Translation:- From other languages to English (This is Whisper's end-to-end speech-to-text translation feature)
- Translate transcription files using Facebook NLLB models
 
- Pre-processing audio input with Silero VAD
- Post-processing with speaker diarization using the pyannote model:- To download the pyannote model, you need to have a Huggingface token and manually accept their terms in the pages below:
 
Installation and Running
- Run Locally- Prerequisite- To run this WebUI, you need to have - git,- pythonversion 3.8 ~ 3.10 &- FFmpeg.
 If you're not using an Nvida GPU, or using a different- CUDAversion than 12.4, edit the file- requirements.txtto match your environment.- Please follow the links below to install the necessary software: - git : https://git-scm.com/downloads
- python : https://www.python.org/downloads/
- FFmpeg : https://ffmpeg.org/download.html
- CUDA : https://developer.nvidia.com/cuda-downloads
 - After installing - FFmpeg, make sure to add the- FFmpeg/binfolder to your system- PATH- Installation using the script files- Download the the repository and extract its contents
- Run install.batorinstall.shto install dependencies (It will create avenvdirectory and install dependencies there)
- Start WebUI with start-webui.batorstart-webui.sh(It will runpython app.pyafter activating the venv)
 
- Running with Docker- Install and launch Docker-Desktop 
- Get the repository 
- If needed, update the - docker-compose.yamlto match your environment
- Docker commands: - Build the image ( Image is about ~7GB) - docker compose build- Run the container - docker compose up
- Connect to the WebUI with your browser at - http://localhost:7860
 
VRAM Usages
- This project is integrated with faster-whisper by default for better VRAM usage and transcription speed. 
 According to faster-whisper, the efficiency of the optimized whisper model is as follows:- Implementation - Precision - Beam size - Time - Max. GPU memory - Max. CPU memory - openai/whisper - fp16 - 5 - 4m30s - 11325MB - 9439MB - faster-whisper - fp16 - 5 - 54s - 4755MB - 3244MB 
- Whisper's original VRAM usage table for available models: - Size - Parameters - English-only model - Multilingual model - Required VRAM - Relative speed - tiny - 39 M - tiny.en- tiny- ~1 GB - ~32x - base - 74 M - base.en- base- ~1 GB - ~16x - small - 244 M - small.en- small- ~2 GB - ~6x - medium - 769 M - medium.en- medium- ~5 GB - ~2x - large - 1550 M - N/A - large- ~10 GB - 1x - Note: - .enmodels are for English only, and you can use the- Translate to Englishoption from the other models