Spaces:
Runtime error
A newer version of the Gradio SDK is available:
6.2.0
NeMo ASR+VAD Inference
This example provides the ASR+VAD inference pipeline, with the option to perform only ASR or VAD alone.
Input
There are two types of input
- A manifest passed to
manifest_filepath, - A directory containing audios passed to
audio_dirand also specifyaudio_type(default towav).
The input manifest must be a manifest json file, where each line is a Python dictionary. The fields ["audio_filepath", "offset", "duration"] are required. An example of a manifest file is:
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000}
If you want to calculate WER, provide text in manifest as groundtruth. An example of a manifest file is:
{"audio_filepath": "/path/to/audio_file1", "offset": 0, "duration": 10000, "text": "hello world"}
{"audio_filepath": "/path/to/audio_file2", "offset": 0, "duration": 10000, "text": "hello world"}
Output
Output will be a folder storing the VAD predictions and/or a manifest containing the audio transcriptions. Some temporary data will also be stored.
Usage
To run the code with ASR+VAD default settings:
python speech_to_text_with_vad.py \
manifest_filepath=/PATH/TO/MANIFEST.json \
vad_model=vad_multilingual_frame_marblenet \
asr_model=stt_en_conformer_ctc_large \
vad_config=../conf/vad/frame_vad_infer_postprocess.yaml
To use only ASR and disable VAD, set
vad_model=Noneanduse_rttm=False.To use only VAD, set
asr_model=Noneand specify bothvad_modelandvad_config.To enable profiling, set
profiling=True, but this will significantly slow down the program.
Using RTTM to handle non-speech audio segments
To use or disable RTTM usage, set
use_rttmtoTrueorFalse. There are two options to use RTTM files, as specified by the parameterrttm_mode, which must be one ofmaskordrop. Formask, the RTTM file will be used to mask the non-speech features. Fordrop, the RTTM file will be used to drop the non-speech features.It's recommended that for
rttm_mode='drop', use largerpad_onsetandpad_offsetto avoid dropping speech features.To use a specific value for feature masking, set
feat_mask_valto the desired value. Default isfeat_mask_val=None, where -16.530 (zero log mel-spectrogram value) will be used forpost_normand 0 (same as SpecAugment) will be used forpre_norm.To normalize feature before masking, set
normalize=pre_norm, and setnormalize=post_normfor masking before normalization.
Frame-VAD and Segment-VAD
- By default,
speech_to_text_with_vad.pyandvad_config=../conf/vad/frame_vad_infer_postprocess.yamlwill use a frame-VAD model, which generates a speech/non-speech prediction for each audio frame of 20ms. - To use segment-VAD, use
speech_to_text_with_vad.py vad_type='segment' vad_config=../conf/vad/vad_inference_postprocessing.yamlinstead. In segment-VAD, the audio is split into segments and VAD is performed on each segment. The segments are then stitched together to form the final output. The segment size and stride can be specified bywindow_length_in_secandshift_length_in_secin the VAD config (e.g.,../conf/vad/vad_inference_postprocessing.yaml) respectively. The default values are 0.63 seconds and 0.08 seconds respectively.
More options
- See more options in the
InferenceConfigdata class.