{
"cells": [
{
"cell_type": "markdown",
"id": "13dc05a3-de12-4d7a-a926-e99d6d97826e",
"metadata": {},
"source": [
"## Using Stable-ts with any ASR"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5cfee322-ebca-4c23-87a4-a109a2f85203",
"metadata": {},
"outputs": [],
"source": [
"import stable_whisper\n",
"assert int(stable_whisper.__version__.replace('.', '')) >= 270, f\"Requires Stable-ts 2.7.0+. Current version is {stable_whisper.__version__}.\""
]
},
{
"cell_type": "markdown",
"id": "e6c2dab2-f4df-46f9-b2e8-94dd88522c7d",
"metadata": {},
"source": [
"
\n",
"\n",
"Stable-ts can be used for other ASR models or web APIs by wrapping them as a function then passing it as the first argument to `non_whisper.transcribe_any()`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "7d32fa9f-a54c-4996-97c3-3b360230d029",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def inference(audio, **kwargs) -> dict:\n",
" # run model/API \n",
" # return data as a dictionary\n",
" data = {}\n",
" return data"
]
},
{
"cell_type": "markdown",
"id": "856ef1fd-f489-42af-a90c-97323fd05a6b",
"metadata": {},
"source": [
"The data returned by the function must be one of the following:\n",
"- an instance of `WhisperResult` containing the data\n",
"- a dictionary in an appropriate mapping\n",
"- a path of JSON file containing data in an appropriate mapping"
]
},
{
"cell_type": "markdown",
"id": "bbdebdad-af1d-4077-8e99-20e767a0fd91",
"metadata": {},
"source": [
"Here are the 3 types of mappings:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "06bc4ce7-5117-4674-8eb9-c343c13c18bc",
"metadata": {},
"outputs": [],
"source": [
"#1:\n",
"essential_mapping = [\n",
" [ # 1st Segment\n",
" {'word': ' And', 'start': 0.0, 'end': 1.28}, \n",
" {'word': ' when', 'start': 1.28, 'end': 1.52}, \n",
" {'word': ' no', 'start': 1.52, 'end': 2.26}, \n",
" {'word': ' ocean,', 'start': 2.26, 'end': 2.68},\n",
" {'word': ' mountain,', 'start': 3.28, 'end': 3.58}\n",
" ], \n",
" [ # 2nd Segment\n",
" {'word': ' or', 'start': 4.0, 'end': 4.08}, \n",
" {'word': ' sky', 'start': 4.08, 'end': 4.56}, \n",
" {'word': ' could', 'start': 4.56, 'end': 4.84}, \n",
" {'word': ' contain', 'start': 4.84, 'end': 5.26}, \n",
" {'word': ' us,', 'start': 5.26, 'end': 6.27},\n",
" {'word': ' our', 'start': 6.27, 'end': 6.58}, \n",
" {'word': ' gaze', 'start': 6.58, 'end': 6.98}, \n",
" {'word': ' hungered', 'start': 6.98, 'end': 7.88}, \n",
" {'word': ' starward.', 'start': 7.88, 'end': 8.64}\n",
" ]\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "b53bd812-2838-4f47-ab5f-5e729801aaee",
"metadata": {},
"source": [
"
\n",
"\n",
"If word timings are not available they can be omitted, but operations that can be performed on this data will be limited."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "8c6bf720-5bfd-4e79-90e7-7049a2ca1d3a",
"metadata": {},
"outputs": [],
"source": [
"#2:\n",
"no_word_mapping = [\n",
" {\n",
" 'start': 0.0, \n",
" 'end': 3.58, \n",
" 'text': ' And when no ocean, mountain,',\n",
" }, \n",
" {\n",
" 'start': 4.0, \n",
" 'end': 8.64, \n",
" 'text': ' or sky could contain us, our gaze hungered starward.', \n",
" }\n",
"]"
]
},
{
"cell_type": "markdown",
"id": "108e960f-8bd1-4d2a-92bf-cc8cb56f4615",
"metadata": {},
"source": [
"
\n",
"\n",
"Below is the full mapping for normal Stable-ts results. `None` takes the place of any omitted values except for `start`, `end`, and `text`/`word` which are required."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "2969aad2-c8bf-4043-8015-669a3102e158",
"metadata": {},
"outputs": [],
"source": [
"#3:\n",
"full_mapping = {\n",
" 'language': 'en',\n",
" 'text': ' And when no ocean, mountain, or sky could contain us, our gaze hungered starward.', \n",
" 'segments': [\n",
" {\n",
" 'seek': 0.0, \n",
" 'start': 0.0, \n",
" 'end': 3.58, \n",
" 'text': ' And when no ocean, mountain,', \n",
" 'tokens': [400, 562, 572, 7810, 11, 6937, 11], \n",
" 'temperature': 0.0, \n",
" 'avg_logprob': -0.48702024376910663, \n",
" 'compression_ratio': 1.0657894736842106, \n",
" 'no_speech_prob': 0.3386174440383911, \n",
" 'id': 0, \n",
" 'words': [\n",
" {'word': ' And', 'start': 0.04, 'end': 1.28, 'probability': 0.6481522917747498, 'tokens': [400]}, \n",
" {'word': ' when', 'start': 1.28, 'end': 1.52, 'probability': 0.9869539141654968, 'tokens': [562]}, \n",
" {'word': ' no', 'start': 1.52, 'end': 2.26, 'probability': 0.57384192943573, 'tokens': [572]}, \n",
" {'word': ' ocean,', 'start': 2.26, 'end': 2.68, 'probability': 0.9484889507293701, 'tokens': [7810, 11]},\n",
" {'word': ' mountain,', 'start': 3.28, 'end': 3.58, 'probability': 0.9581122398376465, 'tokens': [6937, 11]}\n",
" ]\n",
" }, \n",
" {\n",
" 'seek': 0.0, \n",
" 'start': 4.0, \n",
" 'end': 8.64, \n",
" 'text': ' or sky could contain us, our gaze hungered starward.', \n",
" 'tokens': [420, 5443, 727, 5304, 505, 11, 527, 24294, 5753, 4073, 3543, 1007, 13], \n",
" 'temperature': 0.0, \n",
" 'avg_logprob': -0.48702024376910663, \n",
" 'compression_ratio': 1.0657894736842106, \n",
" 'no_speech_prob': 0.3386174440383911, \n",
" 'id': 1, \n",
" 'words': [\n",
" {'word': ' or', 'start': 4.0, 'end': 4.08, 'probability': 0.9937937259674072, 'tokens': [420]}, \n",
" {'word': ' sky', 'start': 4.08, 'end': 4.56, 'probability': 0.9950089454650879, 'tokens': [5443]}, \n",
" {'word': ' could', 'start': 4.56, 'end': 4.84, 'probability': 0.9915681481361389, 'tokens': [727]}, \n",
" {'word': ' contain', 'start': 4.84, 'end': 5.26, 'probability': 0.898974597454071, 'tokens': [5304]}, \n",
" {'word': ' us,', 'start': 5.26, 'end': 6.27, 'probability': 0.999351441860199, 'tokens': [505, 11]},\n",
" {'word': ' our', 'start': 6.27, 'end': 6.58, 'probability': 0.9634224772453308, 'tokens': [527]}, \n",
" {'word': ' gaze', 'start': 6.58, 'end': 6.98, 'probability': 0.8934874534606934, 'tokens': [24294]}, \n",
" {'word': ' hungered', 'start': 6.98, 'end': 7.88, 'probability': 0.7424876093864441, 'tokens': [5753, 4073]}, \n",
" {'word': ' starward.', 'start': 7.88, 'end': 8.64, 'probability': 0.464096799492836, 'tokens': [3543, 1007, 13]}\n",
" ]\n",
" }\n",
" ]\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "49d136e4-0f7d-4dcf-84f9-efb6f0eda491",
"metadata": {},
"source": [
"
\n",
"\n",
"The function must also have `audio` as a parameter."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "33f03286-69f9-4ae1-aec0-250fd92a8cb6",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"def inference(audio, **kwargs) -> dict:\n",
" # run model/API on the audio\n",
" # return data in a proper format\n",
" return essential_mapping"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d6710eb5-5386-42cf-b6e7-02a84b5fad40",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"result = stable_whisper.transcribe_any(inference, './demo.wav', vad=True)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "6d7f9de6-5c9b-4c73-808d-640b13efb051",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0\n",
"00:00:01,122 --> 00:00:02,680\n",
"And when no ocean,\n",
"\n",
"1\n",
"00:00:03,280 --> 00:00:03,580\n",
"mountain,\n",
"\n",
"2\n",
"00:00:04,000 --> 00:00:06,046\n",
"or sky could contain us,\n",
"\n",
"3\n",
"00:00:06,402 --> 00:00:08,640\n",
"our gaze hungered starward.\n"
]
}
],
"source": [
"print(result.to_srt_vtt(word_level=False))"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "be5a45e8-1b25-4a70-9af6-94bc5379fc7d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
" Transcribe an audio file using any ASR system.\n",
"\n",
" Parameters\n",
" ----------\n",
" inference_func: Callable\n",
" Function that runs ASR when provided the [audio] and return data in the appropriate format.\n",
" For format examples: https://github.com/jianfch/stable-ts/blob/main/examples/non-whisper.ipynb\n",
"\n",
" audio: Union[str, np.ndarray, torch.Tensor, bytes]\n",
" The path/URL to the audio file, the audio waveform, or bytes of audio file.\n",
"\n",
" audio_type: str\n",
" The type that [audio] needs to be for [inference_func]. (Default: Same type as [audio])\n",
"\n",
" Types:\n",
" None (default)\n",
" same type as [audio]\n",
"\n",
" 'str'\n",
" a path to the file\n",
" -if [audio] is a file and not audio preprocessing is done,\n",
" [audio] will be directly passed into [inference_func]\n",
" -if audio preprocessing is performed (from [demucs] and/or [only_voice_freq]),\n",
" the processed audio will be encoded into [temp_file] and then passed into [inference_func]\n",
"\n",
" 'byte'\n",
" bytes (used for APIs or to avoid writing any data to hard drive)\n",
" -if [audio] is file, the bytes of file is used\n",
" -if [audio] PyTorch tensor or NumPy array, the bytes of the [audio] encoded into WAV format is used\n",
"\n",
" 'torch'\n",
" a PyTorch tensor containing the audio waveform, in float32 dtype, on CPU\n",
"\n",
" 'numpy'\n",
" a NumPy array containing the audio waveform, in float32 dtype\n",
"\n",
" input_sr: int\n",
" The sample rate of [audio]. (Default: Auto-detected if [audio] is str/bytes)\n",
"\n",
" model_sr: int\n",
" The sample rate to resample the audio into for [inference_func]. (Default: Same as [input_sr])\n",
" Resampling is only performed when [model_sr] do not match the sample rate of the final audio due to:\n",
" -[input_sr] not matching\n",
" -sample rate changed due to audio preprocessing from [demucs]=True\n",
"\n",
" inference_kwargs: dict\n",
" Dictionary of arguments provided to [inference_func]. (Default: None)\n",
"\n",
" temp_file: str\n",
" Temporary path for the preprocessed audio when [audio_type]='str'. (Default: './_temp_stable-ts_audio_.wav')\n",
"\n",
" verbose: bool\n",
" Whether to display the text being decoded to the console. If True, displays all the details,\n",
" If False, displays progressbar. If None, does not display anything (Default: False)\n",
"\n",
" regroup: Union[bool, str]\n",
" Whether to regroup all words into segments with more natural boundaries. (Default: True)\n",
" Specify string for customizing the regrouping algorithm.\n",
" Ignored if [word_timestamps]=False.\n",
"\n",
" suppress_silence: bool\n",
" Whether to suppress timestamp where audio is silent at segment-level\n",
" and word-level if [suppress_word_ts]=True. (Default: True)\n",
"\n",
" suppress_word_ts: bool\n",
" Whether to suppress timestamps, if [suppress_silence]=True, where audio is silent at word-level. (Default: True)\n",
"\n",
" q_levels: int\n",
" Quantization levels for generating timestamp suppression mask; ignored if [vad]=true. (Default: 20)\n",
" Acts as a threshold to marking sound as silent.\n",
" Fewer levels will increase the threshold of volume at which to mark a sound as silent.\n",
"\n",
" k_size: int\n",
" Kernel size for avg-pooling waveform to generate timestamp suppression mask; ignored if [vad]=true. (Default: 5)\n",
" Recommend 5 or 3; higher sizes will reduce detection of silence.\n",
"\n",
" demucs: bool\n",
" Whether to preprocess the audio track with Demucs to isolate vocals/remove noise. (Default: False)\n",
" Demucs must be installed to use. Official repo: https://github.com/facebookresearch/demucs\n",
"\n",
" demucs_device: str\n",
" Device to use for demucs: 'cuda' or 'cpu'. (Default. 'cuda' if torch.cuda.is_available() else 'cpu')\n",
"\n",
" demucs_output: str\n",
" Path to save the vocals isolated by Demucs as WAV file. Ignored if [demucs]=False.\n",
" Demucs must be installed to use. Official repo: https://github.com/facebookresearch/demucs\n",
"\n",
" vad: bool\n",
" Whether to use Silero VAD to generate timestamp suppression mask. (Default: False)\n",
" Silero VAD requires PyTorch 1.12.0+. Official repo: https://github.com/snakers4/silero-vad\n",
"\n",
" vad_threshold: float\n",
" Threshold for detecting speech with Silero VAD. (Default: 0.35)\n",
" Low threshold reduces false positives for silence detection.\n",
"\n",
" vad_onnx: bool\n",
" Whether to use ONNX for Silero VAD. (Default: False)\n",
"\n",
" min_word_dur: float\n",
" Only allow suppressing timestamps that result in word durations greater than this value. (default: 0.1)\n",
"\n",
" only_voice_freq: bool\n",
" Whether to only use sound between 200 - 5000 Hz, where majority of human speech are. (Default: False)\n",
"\n",
" only_ffmpeg: bool\n",
" Whether to use only FFmpeg (and not yt-dlp) for URls. (Default: False)\n",
"\n",
" Returns\n",
" -------\n",
" An instance of WhisperResult.\n",
" \n"
]
}
],
"source": [
"print(stable_whisper.transcribe_any.__doc__)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a99ee627-6ab4-411d-ba27-d372d3647593",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}