rpayanm
/

stable-ts

Model card Files Files and versions Community
File size: 16,127 Bytes
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "13dc05a3-de12-4d7a-a926-e99d6d97826e",
   "metadata": {},
   "source": [
    "## Using Stable-ts with any ASR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5cfee322-ebca-4c23-87a4-a109a2f85203",
   "metadata": {},
   "outputs": [],
   "source": [
    "import stable_whisper\n",
    "assert int(stable_whisper.__version__.replace('.', '')) >= 270, f\"Requires Stable-ts 2.7.0+. Current version is {stable_whisper.__version__}.\""
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e6c2dab2-f4df-46f9-b2e8-94dd88522c7d",
   "metadata": {},
   "source": [
    "<br />\n",
    "\n",
    "Stable-ts can be used for other ASR models or web APIs by wrapping them as a function then passing it as the first argument to `non_whisper.transcribe_any()`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "7d32fa9f-a54c-4996-97c3-3b360230d029",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def inference(audio, **kwargs) -> dict:\n",
    "    # run model/API \n",
    "    # return data as a dictionary\n",
    "    data = {}\n",
    "    return data"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "856ef1fd-f489-42af-a90c-97323fd05a6b",
   "metadata": {},
   "source": [
    "The data returned by the function must be one of the following:\n",
    "- an instance of `WhisperResult` containing the data\n",
    "- a dictionary in an appropriate mapping\n",
    "- a path of JSON file containing data in an appropriate mapping"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbdebdad-af1d-4077-8e99-20e767a0fd91",
   "metadata": {},
   "source": [
    "Here are the 3 types of mappings:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "06bc4ce7-5117-4674-8eb9-c343c13c18bc",
   "metadata": {},
   "outputs": [],
   "source": [
    "#1:\n",
    "essential_mapping = [\n",
    "    [   # 1st Segment\n",
    "        {'word': ' And', 'start': 0.0, 'end': 1.28}, \n",
    "        {'word': ' when', 'start': 1.28, 'end': 1.52}, \n",
    "        {'word': ' no', 'start': 1.52, 'end': 2.26}, \n",
    "        {'word': ' ocean,', 'start': 2.26, 'end': 2.68},\n",
    "        {'word': ' mountain,', 'start': 3.28, 'end': 3.58}\n",
    "    ], \n",
    "    [   # 2nd Segment\n",
    "        {'word': ' or', 'start': 4.0, 'end': 4.08}, \n",
    "        {'word': ' sky', 'start': 4.08, 'end': 4.56}, \n",
    "        {'word': ' could', 'start': 4.56, 'end': 4.84}, \n",
    "        {'word': ' contain', 'start': 4.84, 'end': 5.26}, \n",
    "        {'word': ' us,', 'start': 5.26, 'end': 6.27},\n",
    "        {'word': ' our', 'start': 6.27, 'end': 6.58}, \n",
    "        {'word': ' gaze', 'start': 6.58, 'end': 6.98}, \n",
    "        {'word': ' hungered', 'start': 6.98, 'end': 7.88}, \n",
    "        {'word': ' starward.', 'start': 7.88, 'end': 8.64}\n",
    "    ]\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b53bd812-2838-4f47-ab5f-5e729801aaee",
   "metadata": {},
   "source": [
    "<br />\n",
    "\n",
    "If word timings are not available they can be omitted, but operations that can be performed on this data will be limited."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "8c6bf720-5bfd-4e79-90e7-7049a2ca1d3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "#2:\n",
    "no_word_mapping = [\n",
    "    {\n",
    "        'start': 0.0, \n",
    "        'end': 3.58, \n",
    "        'text': ' And when no ocean, mountain,',\n",
    "    }, \n",
    "    {\n",
    "        'start': 4.0, \n",
    "        'end': 8.64, \n",
    "        'text': ' or sky could contain us, our gaze hungered starward.', \n",
    "    }\n",
    "]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "108e960f-8bd1-4d2a-92bf-cc8cb56f4615",
   "metadata": {},
   "source": [
    "<br />\n",
    "\n",
    "Below is the full mapping for normal Stable-ts results. `None` takes the place of any omitted values except for `start`, `end`, and `text`/`word` which are required."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "2969aad2-c8bf-4043-8015-669a3102e158",
   "metadata": {},
   "outputs": [],
   "source": [
    "#3:\n",
    "full_mapping = {\n",
    "    'language': 'en',\n",
    "    'text': ' And when no ocean, mountain, or sky could contain us, our gaze hungered starward.', \n",
    "    'segments': [\n",
    "        {\n",
    "            'seek': 0.0, \n",
    "            'start': 0.0, \n",
    "            'end': 3.58, \n",
    "            'text': ' And when no ocean, mountain,', \n",
    "            'tokens': [400, 562, 572, 7810, 11, 6937, 11], \n",
    "            'temperature': 0.0, \n",
    "            'avg_logprob': -0.48702024376910663, \n",
    "            'compression_ratio': 1.0657894736842106, \n",
    "            'no_speech_prob': 0.3386174440383911, \n",
    "            'id': 0, \n",
    "            'words': [\n",
    "                {'word': ' And', 'start': 0.04, 'end': 1.28, 'probability': 0.6481522917747498, 'tokens': [400]}, \n",
    "                {'word': ' when', 'start': 1.28, 'end': 1.52, 'probability': 0.9869539141654968, 'tokens': [562]}, \n",
    "                {'word': ' no', 'start': 1.52, 'end': 2.26, 'probability': 0.57384192943573, 'tokens': [572]}, \n",
    "                {'word': ' ocean,', 'start': 2.26, 'end': 2.68, 'probability': 0.9484889507293701, 'tokens': [7810, 11]},\n",
    "                {'word': ' mountain,', 'start': 3.28, 'end': 3.58, 'probability': 0.9581122398376465, 'tokens': [6937, 11]}\n",
    "            ]\n",
    "        }, \n",
    "        {\n",
    "            'seek': 0.0, \n",
    "            'start': 4.0, \n",
    "            'end': 8.64, \n",
    "            'text': ' or sky could contain us, our gaze hungered starward.', \n",
    "            'tokens': [420, 5443, 727, 5304, 505, 11, 527, 24294, 5753, 4073, 3543, 1007, 13], \n",
    "            'temperature': 0.0, \n",
    "            'avg_logprob': -0.48702024376910663, \n",
    "            'compression_ratio': 1.0657894736842106, \n",
    "            'no_speech_prob': 0.3386174440383911, \n",
    "            'id': 1, \n",
    "            'words': [\n",
    "                {'word': ' or', 'start': 4.0, 'end': 4.08, 'probability': 0.9937937259674072, 'tokens': [420]}, \n",
    "                {'word': ' sky', 'start': 4.08, 'end': 4.56, 'probability': 0.9950089454650879, 'tokens': [5443]}, \n",
    "                {'word': ' could', 'start': 4.56, 'end': 4.84, 'probability': 0.9915681481361389, 'tokens': [727]}, \n",
    "                {'word': ' contain', 'start': 4.84, 'end': 5.26, 'probability': 0.898974597454071, 'tokens': [5304]}, \n",
    "                {'word': ' us,', 'start': 5.26, 'end': 6.27, 'probability': 0.999351441860199, 'tokens': [505, 11]},\n",
    "                {'word': ' our', 'start': 6.27, 'end': 6.58, 'probability': 0.9634224772453308, 'tokens': [527]}, \n",
    "                {'word': ' gaze', 'start': 6.58, 'end': 6.98, 'probability': 0.8934874534606934, 'tokens': [24294]}, \n",
    "                {'word': ' hungered', 'start': 6.98, 'end': 7.88, 'probability': 0.7424876093864441, 'tokens': [5753, 4073]}, \n",
    "                {'word': ' starward.', 'start': 7.88, 'end': 8.64, 'probability': 0.464096799492836, 'tokens': [3543, 1007, 13]}\n",
    "            ]\n",
    "        }\n",
    "    ]\n",
    "}"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49d136e4-0f7d-4dcf-84f9-efb6f0eda491",
   "metadata": {},
   "source": [
    "<br />\n",
    "\n",
    "The function must also have `audio` as a parameter."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "33f03286-69f9-4ae1-aec0-250fd92a8cb6",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "def inference(audio, **kwargs) -> dict:\n",
    "    # run model/API on the audio\n",
    "    # return data in a proper format\n",
    "    return essential_mapping"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "d6710eb5-5386-42cf-b6e7-02a84b5fad40",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "result = stable_whisper.transcribe_any(inference, './demo.wav', vad=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "6d7f9de6-5c9b-4c73-808d-640b13efb051",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0\n",
      "00:00:01,122 --> 00:00:02,680\n",
      "And when no ocean,\n",
      "\n",
      "1\n",
      "00:00:03,280 --> 00:00:03,580\n",
      "mountain,\n",
      "\n",
      "2\n",
      "00:00:04,000 --> 00:00:06,046\n",
      "or sky could contain us,\n",
      "\n",
      "3\n",
      "00:00:06,402 --> 00:00:08,640\n",
      "our gaze hungered starward.\n"
     ]
    }
   ],
   "source": [
    "print(result.to_srt_vtt(word_level=False))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "be5a45e8-1b25-4a70-9af6-94bc5379fc7d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "    Transcribe an audio file using any ASR system.\n",
      "\n",
      "    Parameters\n",
      "    ----------\n",
      "    inference_func: Callable\n",
      "        Function that runs ASR when provided the [audio] and return data in the appropriate format.\n",
      "        For format examples: https://github.com/jianfch/stable-ts/blob/main/examples/non-whisper.ipynb\n",
      "\n",
      "    audio: Union[str, np.ndarray, torch.Tensor, bytes]\n",
      "        The path/URL to the audio file, the audio waveform, or bytes of audio file.\n",
      "\n",
      "    audio_type: str\n",
      "        The type that [audio] needs to be for [inference_func]. (Default: Same type as [audio])\n",
      "\n",
      "        Types:\n",
      "            None (default)\n",
      "                same type as [audio]\n",
      "\n",
      "            'str'\n",
      "                a path to the file\n",
      "                -if [audio] is a file and not audio preprocessing is done,\n",
      "                    [audio] will be directly passed into [inference_func]\n",
      "                -if audio preprocessing is performed (from [demucs] and/or [only_voice_freq]),\n",
      "                    the processed audio will be encoded into [temp_file] and then passed into [inference_func]\n",
      "\n",
      "            'byte'\n",
      "                bytes (used for APIs or to avoid writing any data to hard drive)\n",
      "                -if [audio] is file, the bytes of file is used\n",
      "                -if [audio] PyTorch tensor or NumPy array, the bytes of the [audio] encoded into WAV format is used\n",
      "\n",
      "            'torch'\n",
      "                a PyTorch tensor containing the audio waveform, in float32 dtype, on CPU\n",
      "\n",
      "            'numpy'\n",
      "                a NumPy array containing the audio waveform, in float32 dtype\n",
      "\n",
      "    input_sr: int\n",
      "        The sample rate of [audio]. (Default: Auto-detected if [audio] is str/bytes)\n",
      "\n",
      "    model_sr: int\n",
      "        The sample rate to resample the audio into for [inference_func]. (Default: Same as [input_sr])\n",
      "        Resampling is only performed when [model_sr] do not match the sample rate of the final audio due to:\n",
      "         -[input_sr] not matching\n",
      "         -sample rate changed due to audio preprocessing from [demucs]=True\n",
      "\n",
      "    inference_kwargs: dict\n",
      "        Dictionary of arguments provided to [inference_func]. (Default: None)\n",
      "\n",
      "    temp_file: str\n",
      "        Temporary path for the preprocessed audio when [audio_type]='str'. (Default: './_temp_stable-ts_audio_.wav')\n",
      "\n",
      "    verbose: bool\n",
      "        Whether to display the text being decoded to the console. If True, displays all the details,\n",
      "        If False, displays progressbar. If None, does not display anything (Default: False)\n",
      "\n",
      "    regroup: Union[bool, str]\n",
      "        Whether to regroup all words into segments with more natural boundaries. (Default: True)\n",
      "        Specify string for customizing the regrouping algorithm.\n",
      "        Ignored if [word_timestamps]=False.\n",
      "\n",
      "    suppress_silence: bool\n",
      "        Whether to suppress timestamp where audio is silent at segment-level\n",
      "        and word-level if [suppress_word_ts]=True. (Default: True)\n",
      "\n",
      "    suppress_word_ts: bool\n",
      "        Whether to suppress timestamps, if [suppress_silence]=True, where audio is silent at word-level. (Default: True)\n",
      "\n",
      "    q_levels: int\n",
      "        Quantization levels for generating timestamp suppression mask; ignored if [vad]=true. (Default: 20)\n",
      "        Acts as a threshold to marking sound as silent.\n",
      "        Fewer levels will increase the threshold of volume at which to mark a sound as silent.\n",
      "\n",
      "    k_size: int\n",
      "        Kernel size for avg-pooling waveform to generate timestamp suppression mask; ignored if [vad]=true. (Default: 5)\n",
      "        Recommend 5 or 3; higher sizes will reduce detection of silence.\n",
      "\n",
      "    demucs: bool\n",
      "        Whether to preprocess the audio track with Demucs to isolate vocals/remove noise. (Default: False)\n",
      "        Demucs must be installed to use. Official repo: https://github.com/facebookresearch/demucs\n",
      "\n",
      "    demucs_device: str\n",
      "        Device to use for demucs: 'cuda' or 'cpu'. (Default. 'cuda' if torch.cuda.is_available() else 'cpu')\n",
      "\n",
      "    demucs_output: str\n",
      "        Path to save the vocals isolated by Demucs as WAV file. Ignored if [demucs]=False.\n",
      "        Demucs must be installed to use. Official repo: https://github.com/facebookresearch/demucs\n",
      "\n",
      "    vad: bool\n",
      "        Whether to use Silero VAD to generate timestamp suppression mask. (Default: False)\n",
      "        Silero VAD requires PyTorch 1.12.0+. Official repo: https://github.com/snakers4/silero-vad\n",
      "\n",
      "    vad_threshold: float\n",
      "        Threshold for detecting speech with Silero VAD. (Default: 0.35)\n",
      "        Low threshold reduces false positives for silence detection.\n",
      "\n",
      "    vad_onnx: bool\n",
      "        Whether to use ONNX for Silero VAD. (Default: False)\n",
      "\n",
      "    min_word_dur: float\n",
      "        Only allow suppressing timestamps that result in word durations greater than this value. (default: 0.1)\n",
      "\n",
      "    only_voice_freq: bool\n",
      "        Whether to only use sound between 200 - 5000 Hz, where majority of human speech are. (Default: False)\n",
      "\n",
      "    only_ffmpeg: bool\n",
      "        Whether to use only FFmpeg (and not yt-dlp) for URls. (Default: False)\n",
      "\n",
      "    Returns\n",
      "    -------\n",
      "    An instance of WhisperResult.\n",
      "    \n"
     ]
    }
   ],
   "source": [
    "print(stable_whisper.transcribe_any.__doc__)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a99ee627-6ab4-411d-ba27-d372d3647593",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.15"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}