bakrianoo
/

sinai-voice-ar-stt

@@ -23,7 +23,7 @@ model-index:
     metrics:
        - name: Test WER
          type: wer
-         value: 23.70
 ---
 # Sinai Voice Arabic Speech Recognition Model
@@ -31,12 +31,136 @@ model-index:
 Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
 on Arabic using the [Common Voice](https://huggingface.co/datasets/common_voice)
-## Usage
 Please install:
 - [PyTorch](https://pytorch.org/)
-- `$ pip3 install jiwer lang_trans torchaudio datasets transformers`
 The model can be used directly (without a language model) as follows:
 ```python
@@ -51,10 +175,15 @@ resamplers = {  # all three sampling rates exist in test split
     44100: torchaudio.transforms.Resample(44100, 16000),
     32000: torchaudio.transforms.Resample(32000, 16000),
 }
 def prepare_example(example):
     speech, sampling_rate = torchaudio.load(example["path"])
-    example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy()
     return example
 dataset = dataset.map(prepare_example)
 processor = Wav2Vec2Processor.from_pretrained("bakrianoo/sinai-voice-ar-stt")
 model = Wav2Vec2ForCTC.from_pretrained("bakrianoo/sinai-voice-ar-stt").eval()
@@ -103,9 +232,8 @@ predicted: أين المشكل
 reference: وَلِلَّهِ يَسْجُدُ مَا فِي السَّمَاوَاتِ وَمَا فِي الْأَرْضِ مِنْ دَابَّةٍ وَالْمَلَائِكَةُ وَهُمْ لَا يَسْتَكْبِرُونَ
 predicted: ولله يسجد ما في السماوات وما في الأرض من دابة والملائكة وهم لا يستكبرون
 ```
-## Evaluation
-CLONED from [elgeish/wav2vec2-large-xlsr-53-arabic](https://huggingface.co/elgeish/wav2vec2-large-xlsr-53-arabic)
 The model can be evaluated as follows on the Arabic test data of Common Voice:
 ```python
@@ -122,10 +250,15 @@ resamplers = {  # all three sampling rates exist in test split
     44100: torchaudio.transforms.Resample(44100, 16000),
     32000: torchaudio.transforms.Resample(32000, 16000),
 }
 def prepare_example(example):
     speech, sampling_rate = torchaudio.load(example["path"])
-    example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy()
     return example
 test_split = test_split.map(prepare_example)
 processor = Wav2Vec2Processor.from_pretrained("bakrianoo/sinai-voice-ar-stt")
 model = Wav2Vec2ForCTC.from_pretrained("bakrianoo/sinai-voice-ar-stt").to("cuda").eval()
@@ -141,8 +274,8 @@ test_split = test_split.map(predict, batched=True, batch_size=16, remove_columns
 transformation = jiwer.Compose([
     # normalize some diacritics, remove punctuation, and replace Persian letters with Arabic ones
     jiwer.SubstituteRegexes({
-        r'[auiFNKo\~_،؟»\?;:\-,\.؛«!"]': "", "\u06D6": "",
-        r"[\|\{]": "A", "p": "h", "ک": "k", "ی": "y"}),
     # default transformation below
     jiwer.RemoveMultipleSpaces(),
     jiwer.Strip(),
@@ -158,7 +291,7 @@ metrics = jiwer.compute_measures(
 )
 print(f"WER: {metrics['wer']:.2%}")
 ```
-**Test Result**: 23.70%
 ## Other Arabic Voice recognition Models

     metrics:
        - name: Test WER
          type: wer
+         value: 23.80
 ---
 # Sinai Voice Arabic Speech Recognition Model
 Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
 on Arabic using the [Common Voice](https://huggingface.co/datasets/common_voice)
+Most of evaluation codes in this documentation are INSPIRED by  [elgeish/wav2vec2-large-xlsr-53-arabic](https://huggingface.co/elgeish/wav2vec2-large-xlsr-53-arabic)
 Please install:
 - [PyTorch](https://pytorch.org/)
+- `$ pip3 install jiwer lang_trans torchaudio datasets transformers pandas tqdm`
+## Benchmark
+We evaluated the model against different Arabic-STT Wav2Vec models.
+|    | model                                 | using_transliation   |      WER |
+|---:|:--------------------------------------|:---------------------|---------:|
+|  0 | bakrianoo/sinai-voice-ar-stt          | True                 | 0.238001 |
+|  1 | elgeish/wav2vec2-large-xlsr-53-arabic | True                 | 0.266527 |
+|  2 | othrif/wav2vec2-large-xlsr-arabic     | True                 | 0.298122 |
+|  3 | bakrianoo/sinai-voice-ar-stt          | False                | 0.448987 |
+|  4 | othrif/wav2vec2-large-xlsr-arabic     | False                | 0.464004 |
+|  5 | anas/wav2vec2-large-xlsr-arabic       | True                 | 0.506191 |
+|  6 | anas/wav2vec2-large-xlsr-arabic       | False                | 0.622288 |
+<details>
+<summary>We used the following <b>CODE</b> to generate the above results</summary>
+```python
+import jiwer
+import torch
+from tqdm.auto import tqdm
+import torchaudio
+from datasets import load_dataset
+from lang_trans.arabic import buckwalter
+from transformers import set_seed, Wav2Vec2ForCTC, Wav2Vec2Processor
+import pandas as pd
+# load test dataset
+set_seed(42)
+test_split = load_dataset("common_voice", "ar", split="test")
+# init sample rate resamplers
+resamplers = {  # all three sampling rates exist in test split
+    48000: torchaudio.transforms.Resample(48000, 16000),
+    44100: torchaudio.transforms.Resample(44100, 16000),
+    32000: torchaudio.transforms.Resample(32000, 16000),
+}
+# WER composer
+transformation = jiwer.Compose([
+    # normalize some diacritics, remove punctuation, and replace Persian letters with Arabic ones
+    jiwer.SubstituteRegexes({
+        r'[auiFNKo\~_،؟»\?;:\-,\.؛«!"]': "", "\u06D6": "",
+        r"[\|\{]": "A", "p": "h", "ک": "k", "ی": "y"}),
+    # default transformation below
+    jiwer.RemoveMultipleSpaces(),
+    jiwer.Strip(),
+    jiwer.SentencesToListOfWords(),
+    jiwer.RemoveEmptyStrings(),
+])
+def prepare_example(example):
+    speech, sampling_rate = torchaudio.load(example["path"])
+    if sampling_rate in resamplers:
+        example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy()
+    else:
+        example["speech"] = resamplers[4800](speech).squeeze().numpy()
+    return example
+def predict(batch):
+    inputs = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding=True)
+    with torch.no_grad():
+        predicted = torch.argmax(model(inputs.input_values.to("cuda")).logits, dim=-1)
+    predicted[predicted == -100] = processor.tokenizer.pad_token_id  # see fine-tuning script
+    batch["predicted"] = processor.batch_decode(predicted)
+    return batch
+# prepare the test dataset
+test_split = test_split.map(prepare_example)
+stt_models = {
+   "elgeish/wav2vec2-large-xlsr-53-arabic",
+   "othrif/wav2vec2-large-xlsr-arabic",
+   "anas/wav2vec2-large-xlsr-arabic",
+   "bakrianoo/sinai-voice-ar-stt"
+}
+stt_results = []
+for model_path in tqdm(stt_models):
+    processor = Wav2Vec2Processor.from_pretrained(model_path)
+    model = Wav2Vec2ForCTC.from_pretrained(model_path).to("cuda").eval()
+    test_split_preds = test_split.map(predict, batched=True, batch_size=56, remove_columns=["speech"])
+    orig_metrics = jiwer.compute_measures(
+        truth=[s for s in test_split_preds["sentence"]],
+        hypothesis=[s for s in test_split_preds["predicted"]],
+        truth_transform=transformation,
+        hypothesis_transform=transformation,
+    )
+    trans_metrics = jiwer.compute_measures(
+        truth=[buckwalter.trans(s) for s in test_split_preds["sentence"]],  # Buckwalter transliteration
+        hypothesis=[buckwalter.trans(s) for s in test_split_preds["predicted"]], # Buckwalter transliteration
+        truth_transform=transformation,
+        hypothesis_transform=transformation,
+    )
+    stt_results.append({
+        "model": model_path,
+        "using_transliation": True,
+        "WER": trans_metrics["wer"]
+    })
+    stt_results.append({
+        "model": model_path,
+        "using_transliation": False,
+        "WER": orig_metrics["wer"]
+    })
+    del model
+    del processor
+stt_results_df = pd.DataFrame(stt_results)
+stt_results_df = stt_results_df.sort_values('WER', axis=0, ascending=True)
+stt_results_df.head(n=50)
+```
+</details>
+## Usage
 The model can be used directly (without a language model) as follows:
 ```python
     44100: torchaudio.transforms.Resample(44100, 16000),
     32000: torchaudio.transforms.Resample(32000, 16000),
 }
 def prepare_example(example):
     speech, sampling_rate = torchaudio.load(example["path"])
+    if sampling_rate in resamplers:
+        example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy()
+    else:
+        example["speech"] = resamplers[4800](speech).squeeze().numpy()
     return example
 dataset = dataset.map(prepare_example)
 processor = Wav2Vec2Processor.from_pretrained("bakrianoo/sinai-voice-ar-stt")
 model = Wav2Vec2ForCTC.from_pretrained("bakrianoo/sinai-voice-ar-stt").eval()
 reference: وَلِلَّهِ يَسْجُدُ مَا فِي السَّمَاوَاتِ وَمَا فِي الْأَرْضِ مِنْ دَابَّةٍ وَالْمَلَائِكَةُ وَهُمْ لَا يَسْتَكْبِرُونَ
 predicted: ولله يسجد ما في السماوات وما في الأرض من دابة والملائكة وهم لا يستكبرون
 ```
+## Evaluation
 The model can be evaluated as follows on the Arabic test data of Common Voice:
 ```python
     44100: torchaudio.transforms.Resample(44100, 16000),
     32000: torchaudio.transforms.Resample(32000, 16000),
 }
 def prepare_example(example):
     speech, sampling_rate = torchaudio.load(example["path"])
+    if sampling_rate in resamplers:
+        example["speech"] = resamplers[sampling_rate](speech).squeeze().numpy()
+    else:
+        example["speech"] = resamplers[4800](speech).squeeze().numpy()
     return example
 test_split = test_split.map(prepare_example)
 processor = Wav2Vec2Processor.from_pretrained("bakrianoo/sinai-voice-ar-stt")
 model = Wav2Vec2ForCTC.from_pretrained("bakrianoo/sinai-voice-ar-stt").to("cuda").eval()
 transformation = jiwer.Compose([
     # normalize some diacritics, remove punctuation, and replace Persian letters with Arabic ones
     jiwer.SubstituteRegexes({
+        r'[auiFNKo\\~_،؟»\\?;:\\-,\\.؛«!"]': "", "\\u06D6": "",
+        r"[\\|\\{]": "A", "p": "h", "ک": "k", "ی": "y"}),
     # default transformation below
     jiwer.RemoveMultipleSpaces(),
     jiwer.Strip(),
 )
 print(f"WER: {metrics['wer']:.2%}")
 ```
+**Test Result**: 23.80%
 ## Other Arabic Voice recognition Models