Upload ggml-silero-v6.2.0.bin
I generated ggml-silero-v6.2.0.bin and it works fine:
% whisper-cli ./samples/jfk.wav --vad --vad-model ./models/ggml-silero-v6.2.0.bin
whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin' whisper_init_with_params_no_state: use gpu = 1 whisper_init_with_params_no_state: flash attn = 1 whisper_init_with_params_no_state: gpu_device = 0 whisper_init_with_params_no_state: dtw = 0 ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.022 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 19069.67 MB whisper_init_with_params_no_state: devices = 3 whisper_init_with_params_no_state: backends = 3 whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 2 (base) whisper_model_load: adding 1607 extra tokens whisper_model_load: n_langs = 99 whisper_model_load: Metal total size = 147.37 MB whisper_model_load: model size = 147.37 MB whisper_backend_init_gpu: using Metal backend ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 ggml_metal_init: picking default device: Apple M2 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true whisper_backend_init: using BLAS backend whisper_init_state: kv self size = 6.29 MB whisper_init_state: kv cross size = 18.87 MB whisper_init_state: kv pad size = 3.15 MB whisper_init_state: compute buffer (conv) = 17.24 MB whisper_init_state: compute buffer (encode) = 23.09 MB whisper_init_state: compute buffer (cross) = 10.81 MB whisper_init_state: compute buffer (decode) = 97.29 MBsystem_info: n_threads = 4 / 8 | WHISPER : COREML = 0 | OPENVINO = 0 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | ACCELERATE = 1 | REPACK = 1 |
main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...
whisper_full: VAD is enabled, processing speech segments only
whisper_vad: VAD is enabled, processing speech segments only
whisper_vad_init_from_file_with_params: loading VAD model from './models/ggml-silero-v6.2.0.bin'
whisper_vad_init_with_params: model type: silero-16k
whisper_vad_init_with_params: model version: 6.2.0
whisper_vad_init_with_params: n_encoder_layers = 4
whisper_vad_init_with_params: encoder_in_channels[0] = 129
whisper_vad_init_with_params: encoder_in_channels[1] = 128
whisper_vad_init_with_params: encoder_in_channels[2] = 64
whisper_vad_init_with_params: encoder_in_channels[3] = 64
whisper_vad_init_with_params: encoder_out_channels[0] = 128
whisper_vad_init_with_params: encoder_out_channels[1] = 64
whisper_vad_init_with_params: encoder_out_channels[2] = 64
whisper_vad_init_with_params: encoder_out_channels[3] = 128
whisper_vad_init_with_params: lstm_input_size = 128
whisper_vad_init_with_params: lstm_hidden_size = 128
whisper_vad_init_with_params: final_conv_in = 128
whisper_vad_init_with_params: final_conv_out = 1
whisper_vad_init_with_params: CPU total size = 0.88 MB
whisper_vad_init_with_params: model size = 0.88 MB
whisper_backend_init_gpu: no GPU found
whisper_backend_init: using BLAS backend
whisper_vad_init_context: compute buffer (VAD) = 1.60 MB
whisper_vad_segments_from_samples: detecting speech timestamps in 176000 samples
whisper_vad_detect_speech: detecting speech in 176000 samples
whisper_vad_detect_speech: n_chunks: 344
whisper_vad_detect_speech: props size: 344
whisper_vad_detect_speech: chunk_len: 384 < n_window: 512
whisper_vad_detect_speech: vad time = 33.02 ms processing 176000 samples
whisper_vad_segments_from_probs: detecting speech timestamps using 344 probabilities
whisper_vad_segments_from_probs: Merged 1 adjacent segments, now have 4 segments
whisper_vad_segments_from_probs: Final speech segments after filtering: 4
whisper_vad_segments_from_probs: VAD segment 0: start = 0.32, end = 2.27 (duration: 1.95)
whisper_vad_segments_from_probs: VAD segment 1: start = 3.27, end = 4.41 (duration: 1.14)
whisper_vad_segments_from_probs: VAD segment 2: start = 5.38, end = 7.68 (duration: 2.30)
whisper_vad_segments_from_probs: VAD segment 3: start = 8.16, end = 10.62 (duration: 2.46)
whisper_vad: detected 4 speech segments
whisper_vad: Including segment 0: 0.32 - 2.37 (duration: 2.05)
whisper_vad: Including segment 1: 3.27 - 4.51 (duration: 1.24)
whisper_vad: Including segment 2: 5.38 - 7.78 (duration: 2.40)
whisper_vad: Including segment 3: 8.16 - 10.62 (duration: 2.46)
whisper_vad: total duration of speech segments: 8.15 seconds
whisper_vad: vad_segment_info: orig_start: 0.32, orig_end: 2.27, vad_start: 0.00, vad_end: 2.05
whisper_vad: vad_segment_info: orig_start: 3.27, orig_end: 4.41, vad_start: 2.15, vad_end: 3.39
whisper_vad: vad_segment_info: orig_start: 5.38, orig_end: 7.68, vad_start: 3.49, vad_end: 5.89
whisper_vad: vad_segment_info: orig_start: 8.16, orig_end: 10.62, vad_start: 5.99, vad_end: 8.45
whisper_vad: Created time mapping table with 44 points
whisper_vad: Reduced audio from 176000 to 135200 samples (23.2% reduction)[00:00:00.320 --> 00:00:10.510] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.
whisper_print_timings: load time = 110.60 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 3.54 ms
whisper_print_timings: sample time = 38.23 ms / 145 runs ( 0.26 ms per run)
whisper_print_timings: encode time = 71.46 ms / 1 runs ( 71.46 ms per run)
whisper_print_timings: decode time = 19.76 ms / 2 runs ( 9.88 ms per run)
whisper_print_timings: batchd time = 78.24 ms / 139 runs ( 0.56 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 392.64 ms
ggml_metal_free: deallocating