{ "cells": [ { "cell_type": "markdown", "id": "13dc05a3-de12-4d7a-a926-e99d6d97826e", "metadata": {}, "source": [ "## Using Stable-ts with any ASR" ] }, { "cell_type": "code", "execution_count": null, "id": "5cfee322-ebca-4c23-87a4-a109a2f85203", "metadata": {}, "outputs": [], "source": [ "import stable_whisper\n", "assert int(stable_whisper.__version__.replace('.', '')) >= 270, f\"Requires Stable-ts 2.7.0+. Current version is {stable_whisper.__version__}.\"" ] }, { "cell_type": "markdown", "id": "e6c2dab2-f4df-46f9-b2e8-94dd88522c7d", "metadata": {}, "source": [ "
\n", "\n", "Stable-ts can be used for other ASR models or web APIs by wrapping them as a function then passing it as the first argument to `non_whisper.transcribe_any()`." ] }, { "cell_type": "code", "execution_count": 2, "id": "7d32fa9f-a54c-4996-97c3-3b360230d029", "metadata": { "tags": [] }, "outputs": [], "source": [ "def inference(audio, **kwargs) -> dict:\n", " # run model/API \n", " # return data as a dictionary\n", " data = {}\n", " return data" ] }, { "cell_type": "markdown", "id": "856ef1fd-f489-42af-a90c-97323fd05a6b", "metadata": {}, "source": [ "The data returned by the function must be one of the following:\n", "- an instance of `WhisperResult` containing the data\n", "- a dictionary in an appropriate mapping\n", "- a path of JSON file containing data in an appropriate mapping" ] }, { "cell_type": "markdown", "id": "bbdebdad-af1d-4077-8e99-20e767a0fd91", "metadata": {}, "source": [ "Here are the 3 types of mappings:" ] }, { "cell_type": "code", "execution_count": 3, "id": "06bc4ce7-5117-4674-8eb9-c343c13c18bc", "metadata": {}, "outputs": [], "source": [ "#1:\n", "essential_mapping = [\n", " [ # 1st Segment\n", " {'word': ' And', 'start': 0.0, 'end': 1.28}, \n", " {'word': ' when', 'start': 1.28, 'end': 1.52}, \n", " {'word': ' no', 'start': 1.52, 'end': 2.26}, \n", " {'word': ' ocean,', 'start': 2.26, 'end': 2.68},\n", " {'word': ' mountain,', 'start': 3.28, 'end': 3.58}\n", " ], \n", " [ # 2nd Segment\n", " {'word': ' or', 'start': 4.0, 'end': 4.08}, \n", " {'word': ' sky', 'start': 4.08, 'end': 4.56}, \n", " {'word': ' could', 'start': 4.56, 'end': 4.84}, \n", " {'word': ' contain', 'start': 4.84, 'end': 5.26}, \n", " {'word': ' us,', 'start': 5.26, 'end': 6.27},\n", " {'word': ' our', 'start': 6.27, 'end': 6.58}, \n", " {'word': ' gaze', 'start': 6.58, 'end': 6.98}, \n", " {'word': ' hungered', 'start': 6.98, 'end': 7.88}, \n", " {'word': ' starward.', 'start': 7.88, 'end': 8.64}\n", " ]\n", "]" ] }, { "cell_type": "markdown", "id": "b53bd812-2838-4f47-ab5f-5e729801aaee", "metadata": {}, "source": [ "
\n", "\n", "If word timings are not available they can be omitted, but operations that can be performed on this data will be limited." ] }, { "cell_type": "code", "execution_count": 4, "id": "8c6bf720-5bfd-4e79-90e7-7049a2ca1d3a", "metadata": {}, "outputs": [], "source": [ "#2:\n", "no_word_mapping = [\n", " {\n", " 'start': 0.0, \n", " 'end': 3.58, \n", " 'text': ' And when no ocean, mountain,',\n", " }, \n", " {\n", " 'start': 4.0, \n", " 'end': 8.64, \n", " 'text': ' or sky could contain us, our gaze hungered starward.', \n", " }\n", "]" ] }, { "cell_type": "markdown", "id": "108e960f-8bd1-4d2a-92bf-cc8cb56f4615", "metadata": {}, "source": [ "
\n", "\n", "Below is the full mapping for normal Stable-ts results. `None` takes the place of any omitted values except for `start`, `end`, and `text`/`word` which are required." ] }, { "cell_type": "code", "execution_count": 5, "id": "2969aad2-c8bf-4043-8015-669a3102e158", "metadata": {}, "outputs": [], "source": [ "#3:\n", "full_mapping = {\n", " 'language': 'en',\n", " 'text': ' And when no ocean, mountain, or sky could contain us, our gaze hungered starward.', \n", " 'segments': [\n", " {\n", " 'seek': 0.0, \n", " 'start': 0.0, \n", " 'end': 3.58, \n", " 'text': ' And when no ocean, mountain,', \n", " 'tokens': [400, 562, 572, 7810, 11, 6937, 11], \n", " 'temperature': 0.0, \n", " 'avg_logprob': -0.48702024376910663, \n", " 'compression_ratio': 1.0657894736842106, \n", " 'no_speech_prob': 0.3386174440383911, \n", " 'id': 0, \n", " 'words': [\n", " {'word': ' And', 'start': 0.04, 'end': 1.28, 'probability': 0.6481522917747498, 'tokens': [400]}, \n", " {'word': ' when', 'start': 1.28, 'end': 1.52, 'probability': 0.9869539141654968, 'tokens': [562]}, \n", " {'word': ' no', 'start': 1.52, 'end': 2.26, 'probability': 0.57384192943573, 'tokens': [572]}, \n", " {'word': ' ocean,', 'start': 2.26, 'end': 2.68, 'probability': 0.9484889507293701, 'tokens': [7810, 11]},\n", " {'word': ' mountain,', 'start': 3.28, 'end': 3.58, 'probability': 0.9581122398376465, 'tokens': [6937, 11]}\n", " ]\n", " }, \n", " {\n", " 'seek': 0.0, \n", " 'start': 4.0, \n", " 'end': 8.64, \n", " 'text': ' or sky could contain us, our gaze hungered starward.', \n", " 'tokens': [420, 5443, 727, 5304, 505, 11, 527, 24294, 5753, 4073, 3543, 1007, 13], \n", " 'temperature': 0.0, \n", " 'avg_logprob': -0.48702024376910663, \n", " 'compression_ratio': 1.0657894736842106, \n", " 'no_speech_prob': 0.3386174440383911, \n", " 'id': 1, \n", " 'words': [\n", " {'word': ' or', 'start': 4.0, 'end': 4.08, 'probability': 0.9937937259674072, 'tokens': [420]}, \n", " {'word': ' sky', 'start': 4.08, 'end': 4.56, 'probability': 0.9950089454650879, 'tokens': [5443]}, \n", " {'word': ' could', 'start': 4.56, 'end': 4.84, 'probability': 0.9915681481361389, 'tokens': [727]}, \n", " {'word': ' contain', 'start': 4.84, 'end': 5.26, 'probability': 0.898974597454071, 'tokens': [5304]}, \n", " {'word': ' us,', 'start': 5.26, 'end': 6.27, 'probability': 0.999351441860199, 'tokens': [505, 11]},\n", " {'word': ' our', 'start': 6.27, 'end': 6.58, 'probability': 0.9634224772453308, 'tokens': [527]}, \n", " {'word': ' gaze', 'start': 6.58, 'end': 6.98, 'probability': 0.8934874534606934, 'tokens': [24294]}, \n", " {'word': ' hungered', 'start': 6.98, 'end': 7.88, 'probability': 0.7424876093864441, 'tokens': [5753, 4073]}, \n", " {'word': ' starward.', 'start': 7.88, 'end': 8.64, 'probability': 0.464096799492836, 'tokens': [3543, 1007, 13]}\n", " ]\n", " }\n", " ]\n", "}" ] }, { "cell_type": "markdown", "id": "49d136e4-0f7d-4dcf-84f9-efb6f0eda491", "metadata": {}, "source": [ "
\n", "\n", "The function must also have `audio` as a parameter." ] }, { "cell_type": "code", "execution_count": 6, "id": "33f03286-69f9-4ae1-aec0-250fd92a8cb6", "metadata": { "tags": [] }, "outputs": [], "source": [ "def inference(audio, **kwargs) -> dict:\n", " # run model/API on the audio\n", " # return data in a proper format\n", " return essential_mapping" ] }, { "cell_type": "code", "execution_count": 7, "id": "d6710eb5-5386-42cf-b6e7-02a84b5fad40", "metadata": { "tags": [] }, "outputs": [], "source": [ "result = stable_whisper.transcribe_any(inference, './demo.wav', vad=True)" ] }, { "cell_type": "code", "execution_count": 8, "id": "6d7f9de6-5c9b-4c73-808d-640b13efb051", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0\n", "00:00:01,122 --> 00:00:02,680\n", "And when no ocean,\n", "\n", "1\n", "00:00:03,280 --> 00:00:03,580\n", "mountain,\n", "\n", "2\n", "00:00:04,000 --> 00:00:06,046\n", "or sky could contain us,\n", "\n", "3\n", "00:00:06,402 --> 00:00:08,640\n", "our gaze hungered starward.\n" ] } ], "source": [ "print(result.to_srt_vtt(word_level=False))" ] }, { "cell_type": "code", "execution_count": 9, "id": "be5a45e8-1b25-4a70-9af6-94bc5379fc7d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", " Transcribe an audio file using any ASR system.\n", "\n", " Parameters\n", " ----------\n", " inference_func: Callable\n", " Function that runs ASR when provided the [audio] and return data in the appropriate format.\n", " For format examples: https://github.com/jianfch/stable-ts/blob/main/examples/non-whisper.ipynb\n", "\n", " audio: Union[str, np.ndarray, torch.Tensor, bytes]\n", " The path/URL to the audio file, the audio waveform, or bytes of audio file.\n", "\n", " audio_type: str\n", " The type that [audio] needs to be for [inference_func]. (Default: Same type as [audio])\n", "\n", " Types:\n", " None (default)\n", " same type as [audio]\n", "\n", " 'str'\n", " a path to the file\n", " -if [audio] is a file and not audio preprocessing is done,\n", " [audio] will be directly passed into [inference_func]\n", " -if audio preprocessing is performed (from [demucs] and/or [only_voice_freq]),\n", " the processed audio will be encoded into [temp_file] and then passed into [inference_func]\n", "\n", " 'byte'\n", " bytes (used for APIs or to avoid writing any data to hard drive)\n", " -if [audio] is file, the bytes of file is used\n", " -if [audio] PyTorch tensor or NumPy array, the bytes of the [audio] encoded into WAV format is used\n", "\n", " 'torch'\n", " a PyTorch tensor containing the audio waveform, in float32 dtype, on CPU\n", "\n", " 'numpy'\n", " a NumPy array containing the audio waveform, in float32 dtype\n", "\n", " input_sr: int\n", " The sample rate of [audio]. (Default: Auto-detected if [audio] is str/bytes)\n", "\n", " model_sr: int\n", " The sample rate to resample the audio into for [inference_func]. (Default: Same as [input_sr])\n", " Resampling is only performed when [model_sr] do not match the sample rate of the final audio due to:\n", " -[input_sr] not matching\n", " -sample rate changed due to audio preprocessing from [demucs]=True\n", "\n", " inference_kwargs: dict\n", " Dictionary of arguments provided to [inference_func]. (Default: None)\n", "\n", " temp_file: str\n", " Temporary path for the preprocessed audio when [audio_type]='str'. (Default: './_temp_stable-ts_audio_.wav')\n", "\n", " verbose: bool\n", " Whether to display the text being decoded to the console. If True, displays all the details,\n", " If False, displays progressbar. If None, does not display anything (Default: False)\n", "\n", " regroup: Union[bool, str]\n", " Whether to regroup all words into segments with more natural boundaries. (Default: True)\n", " Specify string for customizing the regrouping algorithm.\n", " Ignored if [word_timestamps]=False.\n", "\n", " suppress_silence: bool\n", " Whether to suppress timestamp where audio is silent at segment-level\n", " and word-level if [suppress_word_ts]=True. (Default: True)\n", "\n", " suppress_word_ts: bool\n", " Whether to suppress timestamps, if [suppress_silence]=True, where audio is silent at word-level. (Default: True)\n", "\n", " q_levels: int\n", " Quantization levels for generating timestamp suppression mask; ignored if [vad]=true. (Default: 20)\n", " Acts as a threshold to marking sound as silent.\n", " Fewer levels will increase the threshold of volume at which to mark a sound as silent.\n", "\n", " k_size: int\n", " Kernel size for avg-pooling waveform to generate timestamp suppression mask; ignored if [vad]=true. (Default: 5)\n", " Recommend 5 or 3; higher sizes will reduce detection of silence.\n", "\n", " demucs: bool\n", " Whether to preprocess the audio track with Demucs to isolate vocals/remove noise. (Default: False)\n", " Demucs must be installed to use. Official repo: https://github.com/facebookresearch/demucs\n", "\n", " demucs_device: str\n", " Device to use for demucs: 'cuda' or 'cpu'. (Default. 'cuda' if torch.cuda.is_available() else 'cpu')\n", "\n", " demucs_output: str\n", " Path to save the vocals isolated by Demucs as WAV file. Ignored if [demucs]=False.\n", " Demucs must be installed to use. Official repo: https://github.com/facebookresearch/demucs\n", "\n", " vad: bool\n", " Whether to use Silero VAD to generate timestamp suppression mask. (Default: False)\n", " Silero VAD requires PyTorch 1.12.0+. Official repo: https://github.com/snakers4/silero-vad\n", "\n", " vad_threshold: float\n", " Threshold for detecting speech with Silero VAD. (Default: 0.35)\n", " Low threshold reduces false positives for silence detection.\n", "\n", " vad_onnx: bool\n", " Whether to use ONNX for Silero VAD. (Default: False)\n", "\n", " min_word_dur: float\n", " Only allow suppressing timestamps that result in word durations greater than this value. (default: 0.1)\n", "\n", " only_voice_freq: bool\n", " Whether to only use sound between 200 - 5000 Hz, where majority of human speech are. (Default: False)\n", "\n", " only_ffmpeg: bool\n", " Whether to use only FFmpeg (and not yt-dlp) for URls. (Default: False)\n", "\n", " Returns\n", " -------\n", " An instance of WhisperResult.\n", " \n" ] } ], "source": [ "print(stable_whisper.transcribe_any.__doc__)" ] }, { "cell_type": "code", "execution_count": null, "id": "a99ee627-6ab4-411d-ba27-d372d3647593", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.15" } }, "nbformat": 4, "nbformat_minor": 5 }