## Using Stable-ts with any ASR

In [None]:
import stable_whisper
assert int(stable_whisper.__version__.replace('.', '')) >= 270, f"Requires Stable-ts 2.7.0+. Current version is {stable_whisper.__version__}."

<br />

Stable-ts can be used for other ASR models or web APIs by wrapping them as a function then passing it as the first argument to `non_whisper.transcribe_any()`.

In [2]:
def inference(audio, **kwargs) -> dict:
    # run model/API 
    # return data as a dictionary
    data = {}
    return data

The data returned by the function must be one of the following:
- an instance of `WhisperResult` containing the data
- a dictionary in an appropriate mapping
- a path of JSON file containing data in an appropriate mapping

Here are the 3 types of mappings:

In [3]:
#1:
essential_mapping = [
    [   # 1st Segment
        {'word': ' And', 'start': 0.0, 'end': 1.28}, 
        {'word': ' when', 'start': 1.28, 'end': 1.52}, 
        {'word': ' no', 'start': 1.52, 'end': 2.26}, 
        {'word': ' ocean,', 'start': 2.26, 'end': 2.68},
        {'word': ' mountain,', 'start': 3.28, 'end': 3.58}
    ], 
    [   # 2nd Segment
        {'word': ' or', 'start': 4.0, 'end': 4.08}, 
        {'word': ' sky', 'start': 4.08, 'end': 4.56}, 
        {'word': ' could', 'start': 4.56, 'end': 4.84}, 
        {'word': ' contain', 'start': 4.84, 'end': 5.26}, 
        {'word': ' us,', 'start': 5.26, 'end': 6.27},
        {'word': ' our', 'start': 6.27, 'end': 6.58}, 
        {'word': ' gaze', 'start': 6.58, 'end': 6.98}, 
        {'word': ' hungered', 'start': 6.98, 'end': 7.88}, 
        {'word': ' starward.', 'start': 7.88, 'end': 8.64}
    ]
]

<br />

If word timings are not available they can be omitted, but operations that can be performed on this data will be limited.

In [4]:
#2:
no_word_mapping = [
    {
        'start': 0.0, 
        'end': 3.58, 
        'text': ' And when no ocean, mountain,',
    }, 
    {
        'start': 4.0, 
        'end': 8.64, 
        'text': ' or sky could contain us, our gaze hungered starward.', 
    }
]

<br />

Below is the full mapping for normal Stable-ts results. `None` takes the place of any omitted values except for `start`, `end`, and `text`/`word` which are required.

In [5]:
#3:
full_mapping = {
    'language': 'en',
    'text': ' And when no ocean, mountain, or sky could contain us, our gaze hungered starward.', 
    'segments': [
        {
            'seek': 0.0, 
            'start': 0.0, 
            'end': 3.58, 
            'text': ' And when no ocean, mountain,', 
            'tokens': [400, 562, 572, 7810, 11, 6937, 11], 
            'temperature': 0.0, 
            'avg_logprob': -0.48702024376910663, 
            'compression_ratio': 1.0657894736842106, 
            'no_speech_prob': 0.3386174440383911, 
            'id': 0, 
            'words': [
                {'word': ' And', 'start': 0.04, 'end': 1.28, 'probability': 0.6481522917747498, 'tokens': [400]}, 
                {'word': ' when', 'start': 1.28, 'end': 1.52, 'probability': 0.9869539141654968, 'tokens': [562]}, 
                {'word': ' no', 'start': 1.52, 'end': 2.26, 'probability': 0.57384192943573, 'tokens': [572]}, 
                {'word': ' ocean,', 'start': 2.26, 'end': 2.68, 'probability': 0.9484889507293701, 'tokens': [7810, 11]},
                {'word': ' mountain,', 'start': 3.28, 'end': 3.58, 'probability': 0.9581122398376465, 'tokens': [6937, 11]}
            ]
        }, 
        {
            'seek': 0.0, 
            'start': 4.0, 
            'end': 8.64, 
            'text': ' or sky could contain us, our gaze hungered starward.', 
            'tokens': [420, 5443, 727, 5304, 505, 11, 527, 24294, 5753, 4073, 3543, 1007, 13], 
            'temperature': 0.0, 
            'avg_logprob': -0.48702024376910663, 
            'compression_ratio': 1.0657894736842106, 
            'no_speech_prob': 0.3386174440383911, 
            'id': 1, 
            'words': [
                {'word': ' or', 'start': 4.0, 'end': 4.08, 'probability': 0.9937937259674072, 'tokens': [420]}, 
                {'word': ' sky', 'start': 4.08, 'end': 4.56, 'probability': 0.9950089454650879, 'tokens': [5443]}, 
                {'word': ' could', 'start': 4.56, 'end': 4.84, 'probability': 0.9915681481361389, 'tokens': [727]}, 
                {'word': ' contain', 'start': 4.84, 'end': 5.26, 'probability': 0.898974597454071, 'tokens': [5304]}, 
                {'word': ' us,', 'start': 5.26, 'end': 6.27, 'probability': 0.999351441860199, 'tokens': [505, 11]},
                {'word': ' our', 'start': 6.27, 'end': 6.58, 'probability': 0.9634224772453308, 'tokens': [527]}, 
                {'word': ' gaze', 'start': 6.58, 'end': 6.98, 'probability': 0.8934874534606934, 'tokens': [24294]}, 
                {'word': ' hungered', 'start': 6.98, 'end': 7.88, 'probability': 0.7424876093864441, 'tokens': [5753, 4073]}, 
                {'word': ' starward.', 'start': 7.88, 'end': 8.64, 'probability': 0.464096799492836, 'tokens': [3543, 1007, 13]}
            ]
        }
    ]
}

<br />

The function must also have `audio` as a parameter.

In [6]:
def inference(audio, **kwargs) -> dict:
    # run model/API on the audio
    # return data in a proper format
    return essential_mapping

In [7]:
result = stable_whisper.transcribe_any(inference, './demo.wav', vad=True)

In [8]:
print(result.to_srt_vtt(word_level=False))

0
00:00:01,122 --> 00:00:02,680
And when no ocean,

1
00:00:03,280 --> 00:00:03,580
mountain,

2
00:00:04,000 --> 00:00:06,046
or sky could contain us,

3
00:00:06,402 --> 00:00:08,640
our gaze hungered starward.


In [9]:
print(stable_whisper.transcribe_any.__doc__)


    Transcribe an audio file using any ASR system.

    Parameters
    ----------
    inference_func: Callable
        Function that runs ASR when provided the [audio] and return data in the appropriate format.
        For format examples: https://github.com/jianfch/stable-ts/blob/main/examples/non-whisper.ipynb

    audio: Union[str, np.ndarray, torch.Tensor, bytes]
        The path/URL to the audio file, the audio waveform, or bytes of audio file.

    audio_type: str
        The type that [audio] needs to be for [inference_func]. (Default: Same type as [audio])

        Types:
            None (default)
                same type as [audio]

            'str'
                a path to the file
                -if [audio] is a file and not audio preprocessing is done,
                    [audio] will be directly passed into [inference_func]
                -if audio preprocessing is performed (from [demucs] and/or [only_voice_freq]),
                    the processed audio will be en