espnet
/

Turn_taking_prediction_SWBD

Model card Files Files and versions Community

Siddhant commited on 15 days ago

Commit

766378e

·

verified ·

1 Parent(s): a9b08cf

Update README.md

Files changed (1) hide show

README.md +22 -20

README.md CHANGED Viewed

@@ -2,14 +2,13 @@
 tags:
 - espnet
 - audio
-- automatic-speech-recognition
 language: en
 datasets:
 - swbd
 license: cc-by-4.0
 ---
-## ESPnet2 ASR model
 ### `espnet/Turn_taking_prediction_SWBD`
@@ -28,6 +27,17 @@ cd egs2/swbd/asr1
 ./run.sh --skip_data_prep false --skip_train true --download_model espnet/Turn_taking_prediction_SWBD
 ```
 # RESULTS
 ## asr_train_asr_whisper_turn_taking_target_raw_en_word
@@ -259,6 +269,16 @@ distributed: true
 ### Citing ESPnet
 ```BibTex
 @inproceedings{watanabe2018espnet,
   author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
   title={{ESPnet}: End-to-End Speech Processing Toolkit},
@@ -269,22 +289,4 @@ distributed: true
   url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
 }
-```
-or arXiv:
-```bibtex
-@misc{watanabe2018espnet,
-  title={ESPnet: End-to-End Speech Processing Toolkit},
-  author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
-  year={2018},
-  eprint={1804.00015},
-  archivePrefix={arXiv},
-  primaryClass={cs.CL}
-}
 ```

 tags:
 - espnet
 - audio
 language: en
 datasets:
 - swbd
 license: cc-by-4.0
 ---
+## ESPnet2 Turn taking model
 ### `espnet/Turn_taking_prediction_SWBD`
 ./run.sh --skip_data_prep false --skip_train true --download_model espnet/Turn_taking_prediction_SWBD
 ```
+Use the following Python code to run inference and obtain the probability of a turn-taking event every 40 milliseconds.
+```python
+import soundfile
+import os
+import sys
+from espnet2.bin.asr_inference import Speech2Text
+speech2text = Speech2Text("exp/asr_train_asr_whisper_turn_taking_raw_en_word/config.yaml", "exp/asr_train_asr_whisper_turn_taking_raw_en_word/valid.loss.ave.pth",device="cuda", run_chunk=True)
+audio, rate = soundfile.read(key)
+print(speech2text(audio)[0][0])
+```
 # RESULTS
 ## asr_train_asr_whisper_turn_taking_target_raw_en_word
 ### Citing ESPnet
 ```BibTex
+@inproceedings{
+arora2025talking,
+title={Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics},
+author={Siddhant Arora and Zhiyun Lu and Chung-Cheng Chiu and Ruoming Pang and Shinji Watanabe},
+booktitle={The Thirteenth International Conference on Learning Representations},
+year={2025},
+url={https://openreview.net/forum?id=2e4ECh0ikn}
+}
 @inproceedings{watanabe2018espnet,
   author={Shinji Watanabe and Takaaki Hori and Shigeki Karita and Tomoki Hayashi and Jiro Nishitoba and Yuya Unno and Nelson Yalta and Jahn Heymann and Matthew Wiesner and Nanxin Chen and Adithya Renduchintala and Tsubasa Ochiai},
   title={{ESPnet}: End-to-End Speech Processing Toolkit},
   url={http://dx.doi.org/10.21437/Interspeech.2018-1456}
 }
 ```