File size: 16,127 Bytes
8718761
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "13dc05a3-de12-4d7a-a926-e99d6d97826e",
   "metadata": {},
   "source": [
    "## Using Stable-ts with any ASR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cfee322-ebca-4c23-87a4-a109a2f85203",
   "metadata": {},
   "outputs": [],
   "source": [
    "import stable_whisper\n",
    "assert int(stable_whisper.__version__.replace('.', '')) >= 270, f\"Requires Stable-ts 2.7.0+. Current version is {stable_whisper.__version__}.\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6c2dab2-f4df-46f9-b2e8-94dd88522c7d",
   "metadata": {},
   "source": [
    "<br />\n",
    "\n",
    "Stable-ts can be used for other ASR models or web APIs by wrapping them as a function then passing it as the first argument to `non_whisper.transcribe_any()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7d32fa9f-a54c-4996-97c3-3b360230d029",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def inference(audio, **kwargs) -> dict:\n",
    "    # run model/API \n",
    "    # return data as a dictionary\n",
    "    data = {}\n",
    "    return data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "856ef1fd-f489-42af-a90c-97323fd05a6b",
   "metadata": {},
   "source": [
    "The data returned by the function must be one of the following:\n",
    "- an instance of `WhisperResult` containing the data\n",
    "- a dictionary in an appropriate mapping\n",
    "- a path of JSON file containing data in an appropriate mapping"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbdebdad-af1d-4077-8e99-20e767a0fd91",
   "metadata": {},
   "source": [
    "Here are the 3 types of mappings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "06bc4ce7-5117-4674-8eb9-c343c13c18bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "#1:\n",
    "essential_mapping = [\n",
    "    [   # 1st Segment\n",
    "        {'word': ' And', 'start': 0.0, 'end': 1.28}, \n",
    "        {'word': ' when', 'start': 1.28, 'end': 1.52}, \n",
    "        {'word': ' no', 'start': 1.52, 'end': 2.26}, \n",
    "        {'word': ' ocean,', 'start': 2.26, 'end': 2.68},\n",
    "        {'word': ' mountain,', 'start': 3.28, 'end': 3.58}\n",
    "    ], \n",
    "    [   # 2nd Segment\n",
    "        {'word': ' or', 'start': 4.0, 'end': 4.08}, \n",
    "        {'word': ' sky', 'start': 4.08, 'end': 4.56}, \n",
    "        {'word': ' could', 'start': 4.56, 'end': 4.84}, \n",
    "        {'word': ' contain', 'start': 4.84, 'end': 5.26}, \n",
    "        {'word': ' us,', 'start': 5.26, 'end': 6.27},\n",
    "        {'word': ' our', 'start': 6.27, 'end': 6.58}, \n",
    "        {'word': ' gaze', 'start': 6.58, 'end': 6.98}, \n",
    "        {'word': ' hungered', 'start': 6.98, 'end': 7.88}, \n",
    "        {'word': ' starward.', 'start': 7.88, 'end': 8.64}\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b53bd812-2838-4f47-ab5f-5e729801aaee",
   "metadata": {},
   "source": [
    "<br />\n",
    "\n",
    "If word timings are not available they can be omitted, but operations that can be performed on this data will be limited."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8c6bf720-5bfd-4e79-90e7-7049a2ca1d3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "#2:\n",
    "no_word_mapping = [\n",
    "    {\n",
    "        'start': 0.0, \n",
    "        'end': 3.58, \n",
    "        'text': ' And when no ocean, mountain,',\n",
    "    }, \n",
    "    {\n",
    "        'start': 4.0, \n",
    "        'end': 8.64, \n",
    "        'text': ' or sky could contain us, our gaze hungered starward.', \n",
    "    }\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "108e960f-8bd1-4d2a-92bf-cc8cb56f4615",
   "metadata": {},
   "source": [
    "<br />\n",
    "\n",
    "Below is the full mapping for normal Stable-ts results. `None` takes the place of any omitted values except for `start`, `end`, and `text`/`word` which are required."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2969aad2-c8bf-4043-8015-669a3102e158",
   "metadata": {},
   "outputs": [],
   "source": [
    "#3:\n",
    "full_mapping = {\n",
    "    'language': 'en',\n",
    "    'text': ' And when no ocean, mountain, or sky could contain us, our gaze hungered starward.', \n",
    "    'segments': [\n",
    "        {\n",
    "            'seek': 0.0, \n",
    "            'start': 0.0, \n",
    "            'end': 3.58, \n",
    "            'text': ' And when no ocean, mountain,', \n",
    "            'tokens': [400, 562, 572, 7810, 11, 6937, 11], \n",
    "            'temperature': 0.0, \n",
    "            'avg_logprob': -0.48702024376910663, \n",
    "            'compression_ratio': 1.0657894736842106, \n",
    "            'no_speech_prob': 0.3386174440383911, \n",
    "            'id': 0, \n",
    "            'words': [\n",
    "                {'word': ' And', 'start': 0.04, 'end': 1.28, 'probability': 0.6481522917747498, 'tokens': [400]}, \n",
    "                {'word': ' when', 'start': 1.28, 'end': 1.52, 'probability': 0.9869539141654968, 'tokens': [562]}, \n",
    "                {'word': ' no', 'start': 1.52, 'end': 2.26, 'probability': 0.57384192943573, 'tokens': [572]}, \n",
    "                {'word': ' ocean,', 'start': 2.26, 'end': 2.68, 'probability': 0.9484889507293701, 'tokens': [7810, 11]},\n",
    "                {'word': ' mountain,', 'start': 3.28, 'end': 3.58, 'probability': 0.9581122398376465, 'tokens': [6937, 11]}\n",
    "            ]\n",
    "        }, \n",
    "        {\n",
    "            'seek': 0.0, \n",
    "            'start': 4.0, \n",
    "            'end': 8.64, \n",
    "            'text': ' or sky could contain us, our gaze hungered starward.', \n",
    "            'tokens': [420, 5443, 727, 5304, 505, 11, 527, 24294, 5753, 4073, 3543, 1007, 13], \n",
    "            'temperature': 0.0, \n",
    "            'avg_logprob': -0.48702024376910663, \n",
    "            'compression_ratio': 1.0657894736842106, \n",
    "            'no_speech_prob': 0.3386174440383911, \n",
    "            'id': 1, \n",
    "            'words': [\n",
    "                {'word': ' or', 'start': 4.0, 'end': 4.08, 'probability': 0.9937937259674072, 'tokens': [420]}, \n",
    "                {'word': ' sky', 'start': 4.08, 'end': 4.56, 'probability': 0.9950089454650879, 'tokens': [5443]}, \n",
    "                {'word': ' could', 'start': 4.56, 'end': 4.84, 'probability': 0.9915681481361389, 'tokens': [727]}, \n",
    "                {'word': ' contain', 'start': 4.84, 'end': 5.26, 'probability': 0.898974597454071, 'tokens': [5304]}, \n",
    "                {'word': ' us,', 'start': 5.26, 'end': 6.27, 'probability': 0.999351441860199, 'tokens': [505, 11]},\n",
    "                {'word': ' our', 'start': 6.27, 'end': 6.58, 'probability': 0.9634224772453308, 'tokens': [527]}, \n",
    "                {'word': ' gaze', 'start': 6.58, 'end': 6.98, 'probability': 0.8934874534606934, 'tokens': [24294]}, \n",
    "                {'word': ' hungered', 'start': 6.98, 'end': 7.88, 'probability': 0.7424876093864441, 'tokens': [5753, 4073]}, \n",
    "                {'word': ' starward.', 'start': 7.88, 'end': 8.64, 'probability': 0.464096799492836, 'tokens': [3543, 1007, 13]}\n",
    "            ]\n",
    "        }\n",
    "    ]\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49d136e4-0f7d-4dcf-84f9-efb6f0eda491",
   "metadata": {},
   "source": [
    "<br />\n",
    "\n",
    "The function must also have `audio` as a parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "33f03286-69f9-4ae1-aec0-250fd92a8cb6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def inference(audio, **kwargs) -> dict:\n",
    "    # run model/API on the audio\n",
    "    # return data in a proper format\n",
    "    return essential_mapping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "d6710eb5-5386-42cf-b6e7-02a84b5fad40",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "result = stable_whisper.transcribe_any(inference, './demo.wav', vad=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6d7f9de6-5c9b-4c73-808d-640b13efb051",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "00:00:01,122 --> 00:00:02,680\n",
      "And when no ocean,\n",
      "\n",
      "1\n",
      "00:00:03,280 --> 00:00:03,580\n",
      "mountain,\n",
      "\n",
      "2\n",
      "00:00:04,000 --> 00:00:06,046\n",
      "or sky could contain us,\n",
      "\n",
      "3\n",
      "00:00:06,402 --> 00:00:08,640\n",
      "our gaze hungered starward.\n"
     ]
    }
   ],
   "source": [
    "print(result.to_srt_vtt(word_level=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "be5a45e8-1b25-4a70-9af6-94bc5379fc7d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "    Transcribe an audio file using any ASR system.\n",
      "\n",
      "    Parameters\n",
      "    ----------\n",
      "    inference_func: Callable\n",
      "        Function that runs ASR when provided the [audio] and return data in the appropriate format.\n",
      "        For format examples: https://github.com/jianfch/stable-ts/blob/main/examples/non-whisper.ipynb\n",
      "\n",
      "    audio: Union[str, np.ndarray, torch.Tensor, bytes]\n",
      "        The path/URL to the audio file, the audio waveform, or bytes of audio file.\n",
      "\n",
      "    audio_type: str\n",
      "        The type that [audio] needs to be for [inference_func]. (Default: Same type as [audio])\n",
      "\n",
      "        Types:\n",
      "            None (default)\n",
      "                same type as [audio]\n",
      "\n",
      "            'str'\n",
      "                a path to the file\n",
      "                -if [audio] is a file and not audio preprocessing is done,\n",
      "                    [audio] will be directly passed into [inference_func]\n",
      "                -if audio preprocessing is performed (from [demucs] and/or [only_voice_freq]),\n",
      "                    the processed audio will be encoded into [temp_file] and then passed into [inference_func]\n",
      "\n",
      "            'byte'\n",
      "                bytes (used for APIs or to avoid writing any data to hard drive)\n",
      "                -if [audio] is file, the bytes of file is used\n",
      "                -if [audio] PyTorch tensor or NumPy array, the bytes of the [audio] encoded into WAV format is used\n",
      "\n",
      "            'torch'\n",
      "                a PyTorch tensor containing the audio waveform, in float32 dtype, on CPU\n",
      "\n",
      "            'numpy'\n",
      "                a NumPy array containing the audio waveform, in float32 dtype\n",
      "\n",
      "    input_sr: int\n",
      "        The sample rate of [audio]. (Default: Auto-detected if [audio] is str/bytes)\n",
      "\n",
      "    model_sr: int\n",
      "        The sample rate to resample the audio into for [inference_func]. (Default: Same as [input_sr])\n",
      "        Resampling is only performed when [model_sr] do not match the sample rate of the final audio due to:\n",
      "         -[input_sr] not matching\n",
      "         -sample rate changed due to audio preprocessing from [demucs]=True\n",
      "\n",
      "    inference_kwargs: dict\n",
      "        Dictionary of arguments provided to [inference_func]. (Default: None)\n",
      "\n",
      "    temp_file: str\n",
      "        Temporary path for the preprocessed audio when [audio_type]='str'. (Default: './_temp_stable-ts_audio_.wav')\n",
      "\n",
      "    verbose: bool\n",
      "        Whether to display the text being decoded to the console. If True, displays all the details,\n",
      "        If False, displays progressbar. If None, does not display anything (Default: False)\n",
      "\n",
      "    regroup: Union[bool, str]\n",
      "        Whether to regroup all words into segments with more natural boundaries. (Default: True)\n",
      "        Specify string for customizing the regrouping algorithm.\n",
      "        Ignored if [word_timestamps]=False.\n",
      "\n",
      "    suppress_silence: bool\n",
      "        Whether to suppress timestamp where audio is silent at segment-level\n",
      "        and word-level if [suppress_word_ts]=True. (Default: True)\n",
      "\n",
      "    suppress_word_ts: bool\n",
      "        Whether to suppress timestamps, if [suppress_silence]=True, where audio is silent at word-level. (Default: True)\n",
      "\n",
      "    q_levels: int\n",
      "        Quantization levels for generating timestamp suppression mask; ignored if [vad]=true. (Default: 20)\n",
      "        Acts as a threshold to marking sound as silent.\n",
      "        Fewer levels will increase the threshold of volume at which to mark a sound as silent.\n",
      "\n",
      "    k_size: int\n",
      "        Kernel size for avg-pooling waveform to generate timestamp suppression mask; ignored if [vad]=true. (Default: 5)\n",
      "        Recommend 5 or 3; higher sizes will reduce detection of silence.\n",
      "\n",
      "    demucs: bool\n",
      "        Whether to preprocess the audio track with Demucs to isolate vocals/remove noise. (Default: False)\n",
      "        Demucs must be installed to use. Official repo: https://github.com/facebookresearch/demucs\n",
      "\n",
      "    demucs_device: str\n",
      "        Device to use for demucs: 'cuda' or 'cpu'. (Default. 'cuda' if torch.cuda.is_available() else 'cpu')\n",
      "\n",
      "    demucs_output: str\n",
      "        Path to save the vocals isolated by Demucs as WAV file. Ignored if [demucs]=False.\n",
      "        Demucs must be installed to use. Official repo: https://github.com/facebookresearch/demucs\n",
      "\n",
      "    vad: bool\n",
      "        Whether to use Silero VAD to generate timestamp suppression mask. (Default: False)\n",
      "        Silero VAD requires PyTorch 1.12.0+. Official repo: https://github.com/snakers4/silero-vad\n",
      "\n",
      "    vad_threshold: float\n",
      "        Threshold for detecting speech with Silero VAD. (Default: 0.35)\n",
      "        Low threshold reduces false positives for silence detection.\n",
      "\n",
      "    vad_onnx: bool\n",
      "        Whether to use ONNX for Silero VAD. (Default: False)\n",
      "\n",
      "    min_word_dur: float\n",
      "        Only allow suppressing timestamps that result in word durations greater than this value. (default: 0.1)\n",
      "\n",
      "    only_voice_freq: bool\n",
      "        Whether to only use sound between 200 - 5000 Hz, where majority of human speech are. (Default: False)\n",
      "\n",
      "    only_ffmpeg: bool\n",
      "        Whether to use only FFmpeg (and not yt-dlp) for URls. (Default: False)\n",
      "\n",
      "    Returns\n",
      "    -------\n",
      "    An instance of WhisperResult.\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "print(stable_whisper.transcribe_any.__doc__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a99ee627-6ab4-411d-ba27-d372d3647593",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}