kotoba-tech
/

kotoba-whisper-v2.2

@@ -15,8 +15,8 @@ widget:
 # Kotoba-Whisper-v2.2
 _Kotoba-Whisper-v2.2_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0), with
 additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes
-(i) improved timestamp achieved by [stable-ts](https://github.com/jianfch/stable-ts) and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main).
-These libraries are merged into Kotoba-Whisper-v2.1 via pipeline and will be applied seamlessly to the predicted transcription from [kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0).
 The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech)
 ## Transformers Usage
@@ -30,20 +30,33 @@ pip install "punctuators==0.0.5"
 pip install "pyannote.audio"
 pip install git+https://github.com/huggingface/diarizers.git
 ```
-Also,
-### Transcription
-The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline)
-class to transcribe audio files as follows:
 ```python
 import torch
 from transformers import pipeline
-from datasets import load_dataset
 # config
-model_id = "kotoba-tech/kotoba-whisper-v2.1"
 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
@@ -58,35 +71,66 @@ pipe = pipeline(
     chunk_length_s=15,
     batch_size=16,
     trust_remote_code=True,
-    stable_ts=True,
-    punctuator=True
 )
-# load sample audio
-dataset = load_dataset("japanese-asr/ja_asr.reazonspeech_test", split="test")
-sample = dataset[0]["audio"]
 # run inference
-result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
 print(result)
 ```
-- To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
 ```diff
-- result = pipe(sample, return_timestamps=True, generate_kwargs=generate_kwargs)
-+ result = pipe("audio.mp3", return_timestamps=True, generate_kwargs=generate_kwargs)
-```
-- To deactivate stable-ts:
-```diff
--     stable_ts=True,
-+     stable_ts=False,
 ```
-- To deactivate punctuator:
 ```diff
--     punctuator=True,
-+     punctuator=False,
 ```

 # Kotoba-Whisper-v2.2
 _Kotoba-Whisper-v2.2_ is a Japanese ASR model based on [kotoba-tech/kotoba-whisper-v2.0](https://huggingface.co/kotoba-tech/kotoba-whisper-v2.0), with
 additional postprocessing stacks integrated as [`pipeline`](https://huggingface.co/docs/transformers/en/main_classes/pipelines). The new features includes
+(i) speaker diarization with [diarizers](https://huggingface.co/diarizers-community/speaker-segmentation-fine-tuned-callhome-jpn)
+and (ii) adding punctuation with [punctuators](https://github.com/1-800-BAD-CODE/punctuators/tree/main).
 The pipeline has been developed through the collaboration between [Asahi Ushio](https://asahiushio.com) and [Kotoba Technologies](https://twitter.com/kotoba_tech)
 ## Transformers Usage
 pip install "pyannote.audio"
 pip install git+https://github.com/huggingface/diarizers.git
 ```
+To load pre-trained diarization models from the Hub, you'll first need to accept the terms-of-use for the following two models:
+1. [pyannote/segmentation-3.0](https://hf.co/pyannote/segmentation-3.0)
+2. [pyannote/speaker-diarization-3.1](https://hf.co/pyannote/speaker-diarization-3.1)
+And subsequently use a Hugging Face authentication token to log in with:
+```
+huggingface-cli login
+```
+### Transcription with Diarization
+The model can be used with the [`pipeline`](https://huggingface.co/docs/transformers/main_classes/pipelines#transformers.AutomaticSpeechRecognitionPipeline).
+- Download an audio sample.
+```shell
+wget https://huggingface.co/kotoba-tech/kotoba-whisper-v2.2/resolve/main/sample_audio/sample_diarization_japanese.mp3
+```
+- Run the model via pipeline.
 ```python
 import torch
 from transformers import pipeline
 # config
+model_id = "kotoba-tech/kotoba-whisper-v2.2"
 torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
 device = "cuda:0" if torch.cuda.is_available() else "cpu"
 model_kwargs = {"attn_implementation": "sdpa"} if torch.cuda.is_available() else {}
     chunk_length_s=15,
     batch_size=16,
     trust_remote_code=True,
+    punctuator=False,
+    return_unique_speaker=True
 )
 # run inference
+result = pipe("sample_diarization_japanese.mp3", generate_kwargs=generate_kwargs)
 print(result)
+>>>
+{'chunks': [{'speaker': ['SPEAKER_02'],
+             'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
+             'timestamp': (0.0, 5.0)},
+            {'speaker': ['SPEAKER_02'],
+             'text': '今は屋外の気温',
+             'timestamp': (5.0, 7.6)},
+            {'speaker': ['SPEAKER_02'],
+             'text': '昼も夜も上がってますので空気の入れ替えだけでは',
+             'timestamp': (7.6, 11.72)},
+            {'speaker': ['SPEAKER_02'],
+             'text': 'かえって人が上がってきます',
+             'timestamp': (11.72, 13.54)},
+            {'speaker': ['SPEAKER_02'],
+             'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
+             'timestamp': (13.54, 17.24)},
+            {'speaker': ['SPEAKER_00'],
+             'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
+             'timestamp': (17.24, 23.84)}],
+ 'chunks/SPEAKER_00': [{'speaker': ['SPEAKER_00'],
+                        'text': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
+                        'timestamp': (17.24, 23.84)}],
+ 'chunks/SPEAKER_02': [{'speaker': ['SPEAKER_02'],
+                        'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども',
+                        'timestamp': (0.0, 5.0)},
+                       {'speaker': ['SPEAKER_02'],
+                        'text': '今は屋外の気温',
+                        'timestamp': (5.0, 7.6)},
+                       {'speaker': ['SPEAKER_02'],
+                        'text': '昼も夜も上がってますので空気の入れ替えだけでは',
+                        'timestamp': (7.6, 11.72)},
+                       {'speaker': ['SPEAKER_02'],
+                        'text': 'かえって人が上がってきます',
+                        'timestamp': (11.72, 13.54)},
+                       {'speaker': ['SPEAKER_02'],
+                        'text': 'やっぱり愚直にやっぱりその街の良さをアピールしていくっていう',
+                        'timestamp': (13.54, 17.24)}],
+ 'speakers': ['SPEAKER_00', 'SPEAKER_02'],
+ 'text': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていうそういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
+ 'text/SPEAKER_00': 'そういう姿勢が基本にあった上だのこういうPR作戦だと思うんです',
+ 'text/SPEAKER_02': 'そうですねこれも先ほどがずっと言っている自分の感覚的には大丈夫ですけれども今は屋外の気温昼も夜も上がってますので空気の入れ替えだけではかえって人が上がってきますやっぱり愚直にやっぱりその街の良さをアピールしていくっていう'}
 ```
+- To activate punctuator:
 ```diff
+-     punctuator=True,
++     punctuator=False,
 ```
+- To include more than speakers:
 ```diff
+-     return_unique_speaker=True
++     return_unique_speaker=False
 ```

pipeline/push_pipeline.py CHANGED Viewed

@@ -14,8 +14,8 @@ PIPELINE_REGISTRY.register_pipeline(
     tf_model=TFWhisperForConditionalGeneration
 )
 pipe = pipeline(task="kotoba-whisper", model="kotoba-tech/kotoba-whisper-v2.0", chunk_length_s=15, batch_size=16)
-# output = pipe(test_audio)
-# pprint(output)
 pipe.push_to_hub(model_alias)

     tf_model=TFWhisperForConditionalGeneration
 )
 pipe = pipeline(task="kotoba-whisper", model="kotoba-tech/kotoba-whisper-v2.0", chunk_length_s=15, batch_size=16)
+output = pipe(test_audio)
+pprint(output)
 pipe.push_to_hub(model_alias)