TEN-framework
/

ten-vad

@@ -11,12 +11,11 @@
 [![GitHub forks](https://img.shields.io/github/forks/TEN-framework/ten-vad?style=social&label=Fork)](https://GitHub.com/TEN-framework/ten-vad/network/?WT.mc_id=academic-105485-koreyst)
 [![GitHub stars](https://img.shields.io/github/stars/TEN-framework/ten-vad?style=social&label=Star)](https://GitHub.com/TEN-framework/ten-vad/stargazers/?WT.mc_id=academic-105485-koreyst)
-<br>
 *Latest News* 🔥
 - [2025/06] We **finally** released and **open-sourced** the **ONNX** model and the corresponding **preprocessing code**! Now you can deploy **TEN VAD** on **any platform** and **any hardware architecture**!
 - [2025/06] We are excited to announce the release of **WASM+JS** for Web WASM Support.
-<br>
 ## Table of Contents
@@ -46,13 +45,11 @@
 - [Citations](#citations)
 - [License](#license)
-<br>
 ## Welcome to TEN
 TEN is a collection of open-source projects for building real-time, multimodal conversational voice agents. It includes [ TEN Framework ](https://github.com/ten-framework/ten-framework), [ TEN VAD ](https://github.com/ten-framework/ten-vad), [ TEN Turn Detection ](https://github.com/ten-framework/ten-turn-detection), TEN Agent, TMAN Designer, and [ TEN Portal ](https://github.com/ten-framework/portal), all fully open-source.
-<br>
 | Community Channel | Purpose |
 | ---------------- | ------- |
@@ -62,7 +59,6 @@ TEN is a collection of open-source projects for building real-time, multimodal c
 | [![Hugging Face Space](https://img.shields.io/badge/Hugging%20Face-TEN%20Framework-yellow?style=flat&logo=huggingface)](https://huggingface.co/TEN-framework) | Join our Hugging Face community to explore our spaces and models |
 | [![WeChat](https://img.shields.io/badge/TEN_Framework-WeChat_Group-%2307C160?logo=wechat&labelColor=darkgreen&color=gray)](https://github.com/TEN-framework/ten-agent/discussions/170) | Join our WeChat group for Chinese community discussions |
-<br>
 > \[!IMPORTANT]
 >
@@ -70,11 +66,9 @@ TEN is a collection of open-source projects for building real-time, multimodal c
 >
 > Get instant notifications for new releases and updates. Your support helps us grow and improve TEN!
-<br>
 ![TEN star us gif](https://github.com/user-attachments/assets/eeebe996-8c14-4bf7-82ae-f1a1f7e30705)
-<br>
 ## TEN Hugging Face Space
@@ -82,13 +76,11 @@ TEN is a collection of open-source projects for building real-time, multimodal c
 You are more than welcome to [Visit TEN Hugging Face Space](https://huggingface.co/spaces/TEN-framework/ten-agent-demo) to try VAD and Turn Detection together.
-<br>
 ## **Introduction**
 **TEN VAD** is a real-time voice activity detection system designed for enterprise use,  providing accurate frame-level speech activity detection. It shows superior precision compared to both WebRTC VAD and Silero VAD, which are commonly used in the industry. Additionally, TEN VAD offers lower computational complexity and reduced memory usage compared to Silero VAD. Meanwhile, the architecture's temporal efficiency enables rapid voice activity detection, significantly reducing end-to-end response and turn detection latency in conversational AI systems.
-<br>
 ## **Key Features**
@@ -96,7 +88,6 @@ You are more than welcome to [Visit TEN Hugging Face Space](https://huggingface.
 The precision-recall curves comparing the performance of WebRTC VAD (pitch-based), Silero VAD, and TEN VAD are shown below. The evaluation is conducted on the precisely manually annotated testset. The audio files are from librispeech, gigaspeech, DNS Challenge etc. As demonstrated, TEN VAD achieves the best performance. Additionally, cross-validation experiments conducted on large internal real-world datasets demonstrate the reproducibility of these findings. The **testset with annotated labels** is released in directory "testset" of this repository.
- <br>
 <div style="text-align:">
   <img src="./examples/images/PR_Curves_testset.png" width="800">
@@ -108,14 +99,12 @@ Note that the default threshold of 0.5 is used to generate binary speech indicat
 cd ./examples
 python plot_pr_curves.py
 ```
-<br>
 ### **2. Agent-Friendly:**
 As illustrated in the figure below, TEN VAD rapidly detects speech-to-non-speech transitions, whereas Silero VAD suffers from a delay of several hundred milliseconds, resulting in increased end-to-end latency in human-agent interaction systems. In addition, as demonstrated in the 6.5s-7.0s audio segment, Silero VAD fails to identify short silent durations between adjacent speech segments.
 <div style="text-align:">
   <img src="./examples/images/Agent-Friendly-image.png" width="800">
 </div>
-<br>
 ### **3. Lightweight:**
 We evaluated the RTF (Real-Time Factor) across five distinct platforms, each equipped with varying CPUs. TEN VAD demonstrates much lower computational complexity and smaller library size than Silero VAD.
@@ -126,7 +115,6 @@ We evaluated the RTF (Real-Time Factor) across five distinct platforms, each equ
     <th align="center" rowspan="2" valign="middle"> CPU </th>
     <th align="center" colspan="2"> RTF </th>
     <th align="center" colspan="2"> Lib Size </th>
   </tr>
   <tr>
     <th align="center" style="white-space: nowrap;"> TEN VAD </th>
@@ -138,16 +126,16 @@ We evaluated the RTF (Real-Time Factor) across five distinct platforms, each equ
     <th align="center" rowspan="3"> Linux </th>
     <td style="white-space: nowrap;"> AMD Ryzen 9 5900X 12-Core </td>
     <td align="center"> 0.0150 </td>
-    <td align="center" rowspan="2" valign="middle"> / </td>
-    <td align="center" rowspan="3" valign="middle"> 306KB </td>
-    <td align="center" rowspan="10" style="white-space: nowrap;" valign="middle"> 2.16MB(JIT) / 2.22MB(ONNX) </td>
   </tr>
   <tr>
-    <td style="white-space: nowrap;"> Intel(R) Xeon(R) Platinum 8253 </td>
     <td align="center"> 0.0136 </td>
   </tr>
   <tr>
-    <td style="white-space: nowrap;"> Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz </td>
     <td align="center"> 0.0086 </td>
     <td align="center"> 0.0127 </td>
   </tr>
@@ -155,7 +143,7 @@ We evaluated the RTF (Real-Time Factor) across five distinct platforms, each equ
     <th align="center"> Windows </th>
     <td> Intel i7-10710U </td>
     <td align="center"> 0.0150 </td>
-    <td align="center" rowspan="7" valign="middle"> / </td>
     <td align="center" style="white-space: nowrap;"> 464KB(x86) / 508KB(x64) </td>
   </tr>
   <tr>
@@ -164,17 +152,11 @@ We evaluated the RTF (Real-Time Factor) across five distinct platforms, each equ
     <td align="center"> 0.0160 </td>
     <td align="center"> 731KB </td>
   </tr>
-  <tr>
-    <th align="center"> Web </th>
-    <td> macOS(M1) </td>
-    <td align="center"> 0.010 </td>
-    <td align="center"> 277KB </td>
-  </tr>
   <tr>
     <th align="center" rowspan="2"> Android </th>
     <td> Galaxy J6+ (32bit, 425) </td>
     <td align="center"> 0.0570 </td>
-    <td align="center" rowspan="2" style="white-space: nowrap;"> 373KB(v7a) / 532KB(v8a)</td>
   </tr>
   <tr>
     <td> Oppo A3s (450) </td>
@@ -184,31 +166,31 @@ We evaluated the RTF (Real-Time Factor) across five distinct platforms, each equ
     <th align="center" rowspan="2"> iOS </th>
     <td> iPhone6 (A8) </td>
     <td align="center"> 0.0210 </td>
-    <td align="center" rowspan="2"> 320KB</td>
   </tr>
   <tr>
     <td> iPhone8 (A11) </td>
     <td align="center"> 0.0050 </td>
   </tr>
 </table>
-<br>
 ### **4. Multiple programming languages and platforms:**
 TEN VAD provides cross-platform C compatibility across five operating systems (Linux x64, Windows, macOS, Android, iOS), with Python bindings optimized for Linux x64, with wasm for Web.
-<br>
-<br>
 ### **5. Supproted sampling rate and hop size:**
 TEN VAD operates on 16kHz audio input with configurable hop sizes (optimized frame configurations: 160/256 samples=10/16ms). Other sampling rates must be resampled to 16kHz.
-<br>
-<br>
 ## **Installation**
 ```
-git clone https://github.com/TEN-framework/ten-vad.git
 ```
-<br>
 ## **Quick Start**
 The project supports five major platforms with dynamic library linking.
@@ -226,7 +208,7 @@ The project supports five major platforms with dynamic library linking.
     <td align="center"> libten_vad.so </td>
     <td align="center"> x64 </td>
     <td align="center"> Python, C </td>
-    <td rowspan="6">ten_vad.h <br> ten_vad.py <br> ten_vad.js</td>
     <td>  </td>
   </tr>
   <tr>
@@ -243,13 +225,6 @@ The project supports five major platforms with dynamic library linking.
     <td align="center"> C </td>
     <td>  </td>
   </tr>
-  <tr>
-    <th align="center"> Web </th>
-    <td align="center"> ten_vad.wasm </td>
-    <td align="center"> / </td>
-    <td align="center"> JS </td>
-    <td>  </td>
-  </tr>
   <tr>
     <th align="center"> Android </th>
     <td align="center"> libten_vad.so </td>
@@ -259,14 +234,12 @@ The project supports five major platforms with dynamic library linking.
   </tr>
   <tr>
     <th align="center"> iOS </th>
-    <td align="center"> ten_vad.framework </td>
-    <td align="center"> arm64 </td>
     <td align="center"> C </td>
     <td> 1. not simulator <br> 2. not iPad </td>
   </tr>
 </table>
-<br>
 ### **Python Usage**
 #### **1. Linux**
@@ -321,7 +294,7 @@ cd ./examples
 ```
 python test.py s0724-s0730.wav out.txt
 ```
-<br>
 ##### **By using pip:**
@@ -336,7 +309,7 @@ pip install -U --force-reinstall -v git+https://github.com/TEN-framework/ten-vad
 ```
 from ten_vad import TenVad
 ```
-<br>
 ### **JS Usage**
@@ -350,7 +323,7 @@ from ten_vad import TenVad
 1) cd ./examples
 2) node test_node.js s0724-s0730.wav out.txt
 ```
-<br>
 ### **C Usage**
 #### **Build Scripts**
@@ -380,7 +353,7 @@ Runtime library path configuration:
 - Run demo with sample audio s0724-s0730.wav
 - Processed results saved to out.txt
-<br>
 The detailed usage methods of each platform are as follows <br>
@@ -410,7 +383,6 @@ You have to download the **onnxruntime** packages from the [official website](ht
 ```
 Note: If executing the onnx demo from a different directory than the one used when running build-and-deploy-linux.sh, ensure to create a symbolic link to src/onnx_model/ to prevent ONNX model file loading failures.
-<br>
 ####  **2. Windows**
 ##### **Requirements**
@@ -426,7 +398,7 @@ Note: If executing the onnx demo from a different directory than the one used wh
     - Visual Studio version (default: 2019)
 3) ./build-and-deploy-windows.bat
 ```
-<br>
 ####  **3. macOS**
 ##### **Requirements**
@@ -441,7 +413,7 @@ Note: If executing the onnx demo from a different directory than the one used wh
   - Alternative: x86_64 (Intel)
 3) ./build-and-deploy-mac.sh
 ```
-<br>
 ####  **4. Android**
 ##### **Requirements**
@@ -458,7 +430,7 @@ Note: If executing the onnx demo from a different directory than the one used wh
   - Toolchain: aarch64-linux-android-clang (default) or custom NDK toolchain
 4) ./build-and-deploy-android.sh
 ```
-<br>
 ####  **5. iOS**
 ##### **Requirements**
@@ -510,7 +482,7 @@ cd ./examples
     3.5. Build in Xcode and run demo on your device.
-<br>
 ## TEN Ecosystem
@@ -531,7 +503,6 @@ cd ./examples
 Most questions can be answered by using DeepWiki, it is fast, intutive to use and supports multiple languages.
-<br>
 ## **Citations**
 ```
@@ -545,14 +516,13 @@ Most questions can be answered by using DeepWiki, it is fast, intutive to use an
   email = {[email protected]}
 }
 ```
-<br>
 ## License
 This project is Apache 2.0 with additional conditions licensed. Refer to the "LICENSE" file in the root directory for detailed information. Note that `pitch_est.cc` contains modified code derived from [LPCNet](https://github.com/xiph/LPCNet), which is [BSD-2-Clause](https://spdx.org/licenses/BSD-2-Clause.html) and [BSD-3-Clause](https://spdx.org/licenses/BSD-3-Clause.html) licensed, refer to the NOTICES file in the root directory for detailed information.
-<br>
 [back-to-top]: https://img.shields.io/badge/-Back_to_top-gray?style=flat-square

 [![GitHub forks](https://img.shields.io/github/forks/TEN-framework/ten-vad?style=social&label=Fork)](https://GitHub.com/TEN-framework/ten-vad/network/?WT.mc_id=academic-105485-koreyst)
 [![GitHub stars](https://img.shields.io/github/stars/TEN-framework/ten-vad?style=social&label=Star)](https://GitHub.com/TEN-framework/ten-vad/stargazers/?WT.mc_id=academic-105485-koreyst)
 *Latest News* 🔥
 - [2025/06] We **finally** released and **open-sourced** the **ONNX** model and the corresponding **preprocessing code**! Now you can deploy **TEN VAD** on **any platform** and **any hardware architecture**!
 - [2025/06] We are excited to announce the release of **WASM+JS** for Web WASM Support.
 ## Table of Contents
 - [Citations](#citations)
 - [License](#license)
 ## Welcome to TEN
 TEN is a collection of open-source projects for building real-time, multimodal conversational voice agents. It includes [ TEN Framework ](https://github.com/ten-framework/ten-framework), [ TEN VAD ](https://github.com/ten-framework/ten-vad), [ TEN Turn Detection ](https://github.com/ten-framework/ten-turn-detection), TEN Agent, TMAN Designer, and [ TEN Portal ](https://github.com/ten-framework/portal), all fully open-source.
 | Community Channel | Purpose |
 | ---------------- | ------- |
 | [![Hugging Face Space](https://img.shields.io/badge/Hugging%20Face-TEN%20Framework-yellow?style=flat&logo=huggingface)](https://huggingface.co/TEN-framework) | Join our Hugging Face community to explore our spaces and models |
 | [![WeChat](https://img.shields.io/badge/TEN_Framework-WeChat_Group-%2307C160?logo=wechat&labelColor=darkgreen&color=gray)](https://github.com/TEN-framework/ten-agent/discussions/170) | Join our WeChat group for Chinese community discussions |
 > \[!IMPORTANT]
 >
 >
 > Get instant notifications for new releases and updates. Your support helps us grow and improve TEN!
 ![TEN star us gif](https://github.com/user-attachments/assets/eeebe996-8c14-4bf7-82ae-f1a1f7e30705)
 ## TEN Hugging Face Space
 You are more than welcome to [Visit TEN Hugging Face Space](https://huggingface.co/spaces/TEN-framework/ten-agent-demo) to try VAD and Turn Detection together.
 ## **Introduction**
 **TEN VAD** is a real-time voice activity detection system designed for enterprise use,  providing accurate frame-level speech activity detection. It shows superior precision compared to both WebRTC VAD and Silero VAD, which are commonly used in the industry. Additionally, TEN VAD offers lower computational complexity and reduced memory usage compared to Silero VAD. Meanwhile, the architecture's temporal efficiency enables rapid voice activity detection, significantly reducing end-to-end response and turn detection latency in conversational AI systems.
 ## **Key Features**
 The precision-recall curves comparing the performance of WebRTC VAD (pitch-based), Silero VAD, and TEN VAD are shown below. The evaluation is conducted on the precisely manually annotated testset. The audio files are from librispeech, gigaspeech, DNS Challenge etc. As demonstrated, TEN VAD achieves the best performance. Additionally, cross-validation experiments conducted on large internal real-world datasets demonstrate the reproducibility of these findings. The **testset with annotated labels** is released in directory "testset" of this repository.
 <div style="text-align:">
   <img src="./examples/images/PR_Curves_testset.png" width="800">
 cd ./examples
 python plot_pr_curves.py
 ```
 ### **2. Agent-Friendly:**
 As illustrated in the figure below, TEN VAD rapidly detects speech-to-non-speech transitions, whereas Silero VAD suffers from a delay of several hundred milliseconds, resulting in increased end-to-end latency in human-agent interaction systems. In addition, as demonstrated in the 6.5s-7.0s audio segment, Silero VAD fails to identify short silent durations between adjacent speech segments.
 <div style="text-align:">
   <img src="./examples/images/Agent-Friendly-image.png" width="800">
 </div>
 ### **3. Lightweight:**
 We evaluated the RTF (Real-Time Factor) across five distinct platforms, each equipped with varying CPUs. TEN VAD demonstrates much lower computational complexity and smaller library size than Silero VAD.
     <th align="center" rowspan="2" valign="middle"> CPU </th>
     <th align="center" colspan="2"> RTF </th>
     <th align="center" colspan="2"> Lib Size </th>
   </tr>
   <tr>
     <th align="center" style="white-space: nowrap;"> TEN VAD </th>
     <th align="center" rowspan="3"> Linux </th>
     <td style="white-space: nowrap;"> AMD Ryzen 9 5900X 12-Core </td>
     <td align="center"> 0.0150 </td>
+    <td rowspan="2" style="text-align: center; vertical-align: middle;"> / </td>
+    <td rowspan="3" style="text-align: center; vertical-align: middle;"> 306KB </td>
+    <td rowspan="9" style="text-align: center; vertical-align: middle;"> 2.16MB(JIT) / 2.22MB(ONNX) </td>
   </tr>
   <tr>
+    <td > Intel(R) Xeon(R) Platinum 8253 </td>
     <td align="center"> 0.0136 </td>
   </tr>
   <tr>
+    <td > Intel(R) Xeon(R) Gold 6348 CPU @ 2.60GHz </td>
     <td align="center"> 0.0086 </td>
     <td align="center"> 0.0127 </td>
   </tr>
     <th align="center"> Windows </th>
     <td> Intel i7-10710U </td>
     <td align="center"> 0.0150 </td>
+    <td rowspan="6" style="text-align: center; vertical-align: middle;"> / </td>
     <td align="center" style="white-space: nowrap;"> 464KB(x86) / 508KB(x64) </td>
   </tr>
   <tr>
     <td align="center"> 0.0160 </td>
     <td align="center"> 731KB </td>
   </tr>
   <tr>
     <th align="center" rowspan="2"> Android </th>
     <td> Galaxy J6+ (32bit, 425) </td>
     <td align="center"> 0.0570 </td>
+    <td rowspan="2" style="text-align: center; vertical-align: middle;"> 373KB(v7a) / 532KB(v8a)</td>
   </tr>
   <tr>
     <td> Oppo A3s (450) </td>
     <th align="center" rowspan="2"> iOS </th>
     <td> iPhone6 (A8) </td>
     <td align="center"> 0.0210 </td>
+    <td rowspan="2" style="text-align: center; vertical-align: middle;"> 320KB</td>
   </tr>
   <tr>
     <td> iPhone8 (A11) </td>
     <td align="center"> 0.0050 </td>
   </tr>
 </table>
+<style>
+  th, td {
+    border: 1px solid #ddd;
+    padding: 8px;
+  }
+</style>
 ### **4. Multiple programming languages and platforms:**
 TEN VAD provides cross-platform C compatibility across five operating systems (Linux x64, Windows, macOS, Android, iOS), with Python bindings optimized for Linux x64, with wasm for Web.
 ### **5. Supproted sampling rate and hop size:**
 TEN VAD operates on 16kHz audio input with configurable hop sizes (optimized frame configurations: 160/256 samples=10/16ms). Other sampling rates must be resampled to 16kHz.
 ## **Installation**
 ```
+git clone https://huggingface.co/TEN-framework/ten-vad
 ```
 ## **Quick Start**
 The project supports five major platforms with dynamic library linking.
     <td align="center"> libten_vad.so </td>
     <td align="center"> x64 </td>
     <td align="center"> Python, C </td>
+    <td rowspan="5" style="text-align: center; vertical-align: middle;">ten_vad.h <br> ten_vad.py</td>
     <td>  </td>
   </tr>
   <tr>
     <td align="center"> C </td>
     <td>  </td>
   </tr>
   <tr>
     <th align="center"> Android </th>
     <td align="center"> libten_vad.so </td>
   </tr>
   <tr>
     <th align="center"> iOS </th>
+    <td align="center" style="text-align: center; vertical-align: middle;"> ten_vad.framework </td>
+    <td align="center" style="text-align: center; vertical-align: middle;"> arm64 </td>
     <td align="center"> C </td>
     <td> 1. not simulator <br> 2. not iPad </td>
   </tr>
 </table>
 ### **Python Usage**
 #### **1. Linux**
 ```
 python test.py s0724-s0730.wav out.txt
 ```
 ##### **By using pip:**
 ```
 from ten_vad import TenVad
 ```
 ### **JS Usage**
 1) cd ./examples
 2) node test_node.js s0724-s0730.wav out.txt
 ```
 ### **C Usage**
 #### **Build Scripts**
 - Run demo with sample audio s0724-s0730.wav
 - Processed results saved to out.txt
 The detailed usage methods of each platform are as follows <br>
 ```
 Note: If executing the onnx demo from a different directory than the one used when running build-and-deploy-linux.sh, ensure to create a symbolic link to src/onnx_model/ to prevent ONNX model file loading failures.
 ####  **2. Windows**
 ##### **Requirements**
     - Visual Studio version (default: 2019)
 3) ./build-and-deploy-windows.bat
 ```
 ####  **3. macOS**
 ##### **Requirements**
   - Alternative: x86_64 (Intel)
 3) ./build-and-deploy-mac.sh
 ```
 ####  **4. Android**
 ##### **Requirements**
   - Toolchain: aarch64-linux-android-clang (default) or custom NDK toolchain
 4) ./build-and-deploy-android.sh
 ```
 ####  **5. iOS**
 ##### **Requirements**
     3.5. Build in Xcode and run demo on your device.
 ## TEN Ecosystem
 Most questions can be answered by using DeepWiki, it is fast, intutive to use and supports multiple languages.
 ## **Citations**
 ```
   email = {[email protected]}
 }
 ```
 ## License
 This project is Apache 2.0 with additional conditions licensed. Refer to the "LICENSE" file in the root directory for detailed information. Note that `pitch_est.cc` contains modified code derived from [LPCNet](https://github.com/xiph/LPCNet), which is [BSD-2-Clause](https://spdx.org/licenses/BSD-2-Clause.html) and [BSD-3-Clause](https://spdx.org/licenses/BSD-3-Clause.html) licensed, refer to the NOTICES file in the root directory for detailed information.
 [back-to-top]: https://img.shields.io/badge/-Back_to_top-gray?style=flat-square