Upload ggml-silero-v6.2.0.bin

by KitaitiMakoto - opened Nov 17, 2025

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

-0

KitaitiMakoto

Nov 17, 2025

I generated ggml-silero-v6.2.0.bin and it works fine:

% whisper-cli ./samples/jfk.wav --vad --vad-model ./models/ggml-silero-v6.2.0.bin

whisper_init_from_file_with_params_no_state: loading model from 'models/ggml-base.en.bin' whisper_init_with_params_no_state: use gpu = 1 whisper_init_with_params_no_state: flash attn = 1 whisper_init_with_params_no_state: gpu_device = 0 whisper_init_with_params_no_state: dtw = 0 ggml_metal_library_init: using embedded metal library ggml_metal_library_init: loaded in 0.022 sec ggml_metal_device_init: GPU name: Apple M2 ggml_metal_device_init: GPU family: MTLGPUFamilyApple8 (1008) ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003) ggml_metal_device_init: GPU family: MTLGPUFamilyMetal4 (5002) ggml_metal_device_init: simdgroup reduction = true ggml_metal_device_init: simdgroup matrix mul. = true ggml_metal_device_init: has unified memory = true ggml_metal_device_init: has bfloat = true ggml_metal_device_init: use residency sets = true ggml_metal_device_init: use shared buffers = true ggml_metal_device_init: recommendedMaxWorkingSetSize = 19069.67 MB whisper_init_with_params_no_state: devices = 3 whisper_init_with_params_no_state: backends = 3 whisper_model_load: loading model whisper_model_load: n_vocab = 51864 whisper_model_load: n_audio_ctx = 1500 whisper_model_load: n_audio_state = 512 whisper_model_load: n_audio_head = 8 whisper_model_load: n_audio_layer = 6 whisper_model_load: n_text_ctx = 448 whisper_model_load: n_text_state = 512 whisper_model_load: n_text_head = 8 whisper_model_load: n_text_layer = 6 whisper_model_load: n_mels = 80 whisper_model_load: ftype = 1 whisper_model_load: qntvr = 0 whisper_model_load: type = 2 (base) whisper_model_load: adding 1607 extra tokens whisper_model_load: n_langs = 99 whisper_model_load: Metal total size = 147.37 MB whisper_model_load: model size = 147.37 MB whisper_backend_init_gpu: using Metal backend ggml_metal_init: allocating ggml_metal_init: found device: Apple M2 ggml_metal_init: picking default device: Apple M2 ggml_metal_init: use bfloat = true ggml_metal_init: use fusion = true ggml_metal_init: use concurrency = true ggml_metal_init: use graph optimize = true whisper_backend_init: using BLAS backend whisper_init_state: kv self size = 6.29 MB whisper_init_state: kv cross size = 18.87 MB whisper_init_state: kv pad size = 3.15 MB whisper_init_state: compute buffer (conv) = 17.24 MB whisper_init_state: compute buffer (encode) = 23.09 MB whisper_init_state: compute buffer (cross) = 10.81 MB whisper_init_state: compute buffer (decode) = 97.29 MB

main: processing './samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

whisper_full: VAD is enabled, processing speech segments only
whisper_vad: VAD is enabled, processing speech segments only
whisper_vad_init_from_file_with_params: loading VAD model from './models/ggml-silero-v6.2.0.bin'
whisper_vad_init_with_params: model type: silero-16k
whisper_vad_init_with_params: model version: 6.2.0
whisper_vad_init_with_params: n_encoder_layers = 4
whisper_vad_init_with_params: encoder_in_channels[0] = 129
whisper_vad_init_with_params: encoder_in_channels[1] = 128
whisper_vad_init_with_params: encoder_in_channels[2] = 64
whisper_vad_init_with_params: encoder_in_channels[3] = 64
whisper_vad_init_with_params: encoder_out_channels[0] = 128
whisper_vad_init_with_params: encoder_out_channels[1] = 64
whisper_vad_init_with_params: encoder_out_channels[2] = 64
whisper_vad_init_with_params: encoder_out_channels[3] = 128
whisper_vad_init_with_params: lstm_input_size = 128
whisper_vad_init_with_params: lstm_hidden_size = 128
whisper_vad_init_with_params: final_conv_in = 128
whisper_vad_init_with_params: final_conv_out = 1
whisper_vad_init_with_params: CPU total size = 0.88 MB
whisper_vad_init_with_params: model size = 0.88 MB
whisper_backend_init_gpu: no GPU found
whisper_backend_init: using BLAS backend
whisper_vad_init_context: compute buffer (VAD) = 1.60 MB
whisper_vad_segments_from_samples: detecting speech timestamps in 176000 samples
whisper_vad_detect_speech: detecting speech in 176000 samples
whisper_vad_detect_speech: n_chunks: 344
whisper_vad_detect_speech: props size: 344
whisper_vad_detect_speech: chunk_len: 384 < n_window: 512
whisper_vad_detect_speech: vad time = 33.02 ms processing 176000 samples
whisper_vad_segments_from_probs: detecting speech timestamps using 344 probabilities
whisper_vad_segments_from_probs: Merged 1 adjacent segments, now have 4 segments
whisper_vad_segments_from_probs: Final speech segments after filtering: 4
whisper_vad_segments_from_probs: VAD segment 0: start = 0.32, end = 2.27 (duration: 1.95)
whisper_vad_segments_from_probs: VAD segment 1: start = 3.27, end = 4.41 (duration: 1.14)
whisper_vad_segments_from_probs: VAD segment 2: start = 5.38, end = 7.68 (duration: 2.30)
whisper_vad_segments_from_probs: VAD segment 3: start = 8.16, end = 10.62 (duration: 2.46)
whisper_vad: detected 4 speech segments
whisper_vad: Including segment 0: 0.32 - 2.37 (duration: 2.05)
whisper_vad: Including segment 1: 3.27 - 4.51 (duration: 1.24)
whisper_vad: Including segment 2: 5.38 - 7.78 (duration: 2.40)
whisper_vad: Including segment 3: 8.16 - 10.62 (duration: 2.46)
whisper_vad: total duration of speech segments: 8.15 seconds
whisper_vad: vad_segment_info: orig_start: 0.32, orig_end: 2.27, vad_start: 0.00, vad_end: 2.05
whisper_vad: vad_segment_info: orig_start: 3.27, orig_end: 4.41, vad_start: 2.15, vad_end: 3.39
whisper_vad: vad_segment_info: orig_start: 5.38, orig_end: 7.68, vad_start: 3.49, vad_end: 5.89
whisper_vad: vad_segment_info: orig_start: 8.16, orig_end: 10.62, vad_start: 5.99, vad_end: 8.45
whisper_vad: Created time mapping table with 44 points
whisper_vad: Reduced audio from 176000 to 135200 samples (23.2% reduction)

[00:00:00.320 --> 00:00:10.510] And so my fellow Americans, ask not what your country can do for you, ask what you can do for your country.

whisper_print_timings: load time = 110.60 ms
whisper_print_timings: fallbacks = 0 p / 0 h
whisper_print_timings: mel time = 3.54 ms
whisper_print_timings: sample time = 38.23 ms / 145 runs ( 0.26 ms per run)
whisper_print_timings: encode time = 71.46 ms / 1 runs ( 71.46 ms per run)
whisper_print_timings: decode time = 19.76 ms / 2 runs ( 9.88 ms per run)
whisper_print_timings: batchd time = 78.24 ms / 139 runs ( 0.56 ms per run)
whisper_print_timings: prompt time = 0.00 ms / 1 runs ( 0.00 ms per run)
whisper_print_timings: total time = 392.64 ms
ggml_metal_free: deallocating

Upload ggml-silero-v6.2.0.binc5c26827

ggerganov changed pull request status to merged Nov 17, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment