File size: 6,240 Bytes
308fd89 1954827 308fd89 1954827 a996fdd 1954827 ad582c7 43ed6c0 ad582c7 43ed6c0 c22bda8 6c13b89 c22bda8 07d29b0 dab3d33 a996fdd c0b9938 dab3d33 a996fdd c0b9938 a996fdd 07d29b0 a996fdd c0b9938 a6b967e 07d29b0 be0129e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 |
---
datasets:
- rsalshalan/QASR
- DynamicSuperb/DialectIdentification_ADI17
language:
- ar
- en
metrics:
- bleu
- wer
- accuracy
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
pipeline_tag: audio-text-to-text
---
# TinyOctopus: Bilingual Audio Language Model ๐๐
## ๐ข Overview
**TinyOctopus** is a **Bilingual Audio Language Model (Audio-LLM)** designed to process and generate text from audio inputs. The model leverages **Distil-Whisper (distil-large-v3)** for audio encoding, a **cross-attention projection layer** for alignment, and **DeepSeek 1.5B** for text generation. TinyOctopus is optimized for tasks such as:
- **Bilingual Automatic Speech Recognition (ASR)** ๐ฃ๏ธ
- **Arabic to English Speech Translation** ๐
- **Spoken Arabic Dialect Identification**
TinyOctopus maintaining the architectural principles of the following structure:
## ๐ Model Architecture
### **TinyOctopus integrates:**
1. **Distil-Whisper (distil-large-v3)** for encoding audio inputs.
2. **Cross-Attention Projection Layer** (trainable) to align audio features with textual representations.
3. **DeepSeek 1.5B** as the core language model for text generation.
## ๐ Dataset
The model has been trained on multiple datasets to optimize its performance across different tasks:
- **[QASR Dataset](https://arxiv.org/pdf/2106.13000)**: QASR is the largest transcribed Arabic speech corpus, collected from the broadcast domain. It contains **2,000 hours of multi-dialect speech** sampled at **16kHz** from **Al Jazeera News Channel**, with lightly supervised transcriptions aligned with the audio segments. Unlike previous datasets, QASR includes **linguistically motivated segmentation, punctuation, speaker information**, and more. The dataset is suitable for **ASR, Arabic dialect identification, punctuation restoration, speaker identification, and NLP applications**. Additionally, a **130M-word language model dataset** is available to aid language modeling. Speech recognition models trained on QASR achieve competitive **WER** compared to the MGB-2 corpus, and it has been used for downstream tasks like **Named Entity Recognition (NER)** and **punctuation restoration**.
- **[ADI17 Dataset](https://swshon.github.io/pdf/shon_2020_adi17.pdf)**: ADI17 is a **large-scale Arabic Dialect Identification (DID) dataset**, collected from **YouTube videos** across **17 Arabic-speaking countries in the Middle East and North Africa**. It contains **3,000 hours of speech** for training DID systems and an additional **57 hours** for development and testing. The dataset is categorized into **short (<5s), medium (5-20s), and long (>20s) speech segments** for detailed evaluation. ADI17 enables state-of-the-art **dialect identification** and provides a robust evaluation platform. It has been benchmarked on **domain-mismatched conditions** using the Multi-Genre Broadcast 3 (MGB-3) test set.
## โ๏ธ Installation & Usage
### **๐ป Install Dependencies**
```bash
pip install -r requirements.txt
```
## Inference
```bash
from inference import transcribe
audio_path = "path/to/audio.wav" # Replace with your actual audio file
output = transcribe(audio_path, task="asr") # Options: "dialect", "asr", "translation"
print("Generated Text:", output)
```
---
### How to Try It?
You can test the model by uploading or recording your own audio files using the **Gradio demo**:
โก๏ธ [Try the Model](https://53f0821919c4d2aa02.gradio.live)
---
## Evaluation Results
## ASR Performance (WER & Error Breakdown)
| **Tasks** | **WER (%)** | **Substitution (%)** | **Deletion (%)** | **Insertion (%)** |
|:------------------------------------:|:----------:|:--------------------:|:----------------:|:----------------:|
| **ASR_QASR (Arabic)** | **16.00** | **9.5** | **2.7** | **3.8** |
| **ASR_ibrispeech&tedlium (English)** | **4.50** | **3.0** | **0.8** | **0.7** |
---
## Translation Performance (BLEU Scores)
| **Tasks** | **BLEU (GPT-4o)** | **BLEU (Google)** |
|:--------------:|:----------------:|:----------------:|
| **Translation** | **55.05** | **43.23** |
---
## Dialect Identification Accuracy
| **Tasks** | **Accuracy (%)** |
|:--------------------------:|:---------------:|
| **Dialect Identification** | **70.59** |

---
## Examples
### Example 1: Arabic Speech Recognition
๐ต **Audio Input (Arabic)**:
<audio controls>
<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_1_align.wav" type="audio/wav">
</audio>
๐ **User Prompt**:
> Transcribe the audio
or
> ูู
ุจุชูุฑูุบ ุงูู
ูุทุน ุงูุตูุชู
๐ก **System Response**:
> ุฃููุง ุจูู
ู
ุดุงูุฏููุง ุงููุฑุงู
ูู ุญููุฉ ุฌุฏูุฏุฉ ู
ู ุจุฑูุงู
ุฌ ุงูุงูุชุตุงุฏ ูุงููุงุณ
๐ต **Audio Input (English)**:
<audio controls>
<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/4970-29093-0016.wav" type="audio/wav">
</audio>
๐ **User Prompt**:
> Transcribe the audio
or
> ูู
ุจุชูุฑูุบ ุงูู
ูุทุน ุงูุตูุชู
๐ก **System Response**:
> NO IT'S NOT TOO SOON
---
### Example 2: Arabic to English Translation
๐ต **Audio Input**:
<audio controls>
<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_21_align.wav" type="audio/wav">
</audio>
๐ **User Prompt**:
> Translate the following Arabic speech into English
or
> ูู
ุจุชุฑุฌู
ุฉ ุงูู
ูุทุน ููุฅูุฌููุฒูุฉ
๐ก **System Response**:
> I took a loan a certain amount of money to pay off the debt
---
### Example 3: Dialect Identification
๐ต **Audio Input**:
<audio controls>
<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/tYBpZAOFpvk_071631-073831.wav" type="audio/wav">
</audio>
๐ **User Prompt**:
> Identify the dialect of the given speech
or
> ู
ุงูู ููุฌุฉ ุงูู
ุชุญุฏุซุ
๐ก **System Response**:
> KSA
--- |