File size: 6,240 Bytes
308fd89
 
 
 
 
 
 
 
 
 
 
 
1954827
308fd89
1954827
 
 
 
 
 
 
a996fdd
 
1954827
 
 
 
 
 
 
 
 
 
 
 
 
 
ad582c7
 
 
 
 
 
43ed6c0
 
ad582c7
43ed6c0
 
 
 
 
 
 
 
c22bda8
 
 
 
6c13b89
c22bda8
 
07d29b0
dab3d33
 
a996fdd
 
c0b9938
 
 
dab3d33
a996fdd
 
 
 
c0b9938
 
a996fdd
 
07d29b0
a996fdd
 
c0b9938
 
a6b967e
07d29b0
be0129e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
---
datasets:
- rsalshalan/QASR
- DynamicSuperb/DialectIdentification_ADI17
language:
- ar
- en
metrics:
- bleu
- wer
- accuracy
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
pipeline_tag: audio-text-to-text
---
# TinyOctopus: Bilingual Audio Language Model ๐Ÿ™๐Ÿ”Š

## ๐Ÿ“ข Overview
**TinyOctopus** is a **Bilingual Audio Language Model (Audio-LLM)** designed to process and generate text from audio inputs. The model leverages **Distil-Whisper (distil-large-v3)** for audio encoding, a **cross-attention projection layer** for alignment, and **DeepSeek 1.5B** for text generation. TinyOctopus is optimized for tasks such as:

- **Bilingual Automatic Speech Recognition (ASR)** ๐Ÿ—ฃ๏ธ
- **Arabic to English Speech Translation** ๐ŸŒ
- **Spoken Arabic Dialect Identification**

TinyOctopus maintaining the architectural principles of the following structure:

## ๐Ÿ— Model Architecture
### **TinyOctopus integrates:**
1. **Distil-Whisper (distil-large-v3)** for encoding audio inputs.
2. **Cross-Attention Projection Layer** (trainable) to align audio features with textual representations.
3. **DeepSeek 1.5B** as the core language model for text generation.

## ๐Ÿ“‚ Dataset
The model has been trained on multiple datasets to optimize its performance across different tasks:

- **[QASR Dataset](https://arxiv.org/pdf/2106.13000)**: QASR is the largest transcribed Arabic speech corpus, collected from the broadcast domain. It contains **2,000 hours of multi-dialect speech** sampled at **16kHz** from **Al Jazeera News Channel**, with lightly supervised transcriptions aligned with the audio segments. Unlike previous datasets, QASR includes **linguistically motivated segmentation, punctuation, speaker information**, and more. The dataset is suitable for **ASR, Arabic dialect identification, punctuation restoration, speaker identification, and NLP applications**. Additionally, a **130M-word language model dataset** is available to aid language modeling. Speech recognition models trained on QASR achieve competitive **WER** compared to the MGB-2 corpus, and it has been used for downstream tasks like **Named Entity Recognition (NER)** and **punctuation restoration**.

- **[ADI17 Dataset](https://swshon.github.io/pdf/shon_2020_adi17.pdf)**: ADI17 is a **large-scale Arabic Dialect Identification (DID) dataset**, collected from **YouTube videos** across **17 Arabic-speaking countries in the Middle East and North Africa**. It contains **3,000 hours of speech** for training DID systems and an additional **57 hours** for development and testing. The dataset is categorized into **short (<5s), medium (5-20s), and long (>20s) speech segments** for detailed evaluation. ADI17 enables state-of-the-art **dialect identification** and provides a robust evaluation platform. It has been benchmarked on **domain-mismatched conditions** using the Multi-Genre Broadcast 3 (MGB-3) test set.

## โš™๏ธ Installation & Usage
### **๐Ÿ’ป Install Dependencies**
```bash
pip install -r requirements.txt
```
## Inference

```bash
from inference import transcribe

audio_path = "path/to/audio.wav"  # Replace with your actual audio file
output = transcribe(audio_path, task="asr")  # Options: "dialect", "asr", "translation"

print("Generated Text:", output)
```
---

### How to Try It?
You can test the model by uploading or recording your own audio files using the **Gradio demo**:  
โžก๏ธ [Try the Model](https://53f0821919c4d2aa02.gradio.live)

---

## Evaluation Results

## ASR Performance (WER & Error Breakdown)
| **Tasks**                           | **WER (%)** | **Substitution (%)** | **Deletion (%)** | **Insertion (%)** |
|:------------------------------------:|:----------:|:--------------------:|:----------------:|:----------------:|
| **ASR_QASR (Arabic)**                | **16.00**  | **9.5**             | **2.7**          | **3.8**          |
| **ASR_ibrispeech&tedlium (English)** | **4.50**   | **3.0**             | **0.8**          | **0.7**          |

---

## Translation Performance (BLEU Scores)
| **Tasks**       | **BLEU (GPT-4o)** | **BLEU (Google)** |
|:--------------:|:----------------:|:----------------:|
| **Translation** | **55.05**        | **43.23**        |

---

## Dialect Identification Accuracy
| **Tasks**                  | **Accuracy (%)** |
|:--------------------------:|:---------------:|
| **Dialect Identification** | **70.59**       |


![Confusion matrix of adi17 test set](https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/images/CM_for_DI.png)

---

## Examples

### Example 1: Arabic Speech Recognition
๐ŸŽต **Audio Input (Arabic)**:  
<audio controls>
  <source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_1_align.wav" type="audio/wav">
</audio>

๐Ÿ“ **User Prompt**:  
> Transcribe the audio
or
> ู‚ู… ุจุชูุฑูŠุบ ุงู„ู…ู‚ุทุน ุงู„ุตูˆุชูŠ

๐Ÿ’ก **System Response**:
> ุฃู‡ู„ุง ุจูƒู… ู…ุดุงู‡ุฏูŠู†ุง ุงู„ูƒุฑุงู… ููŠ ุญู„ู‚ุฉ ุฌุฏูŠุฏุฉ ู…ู† ุจุฑู†ุงู…ุฌ ุงู„ุงู‚ุชุตุงุฏ ูˆุงู„ู†ุงุณ

๐ŸŽต **Audio Input (English)**:  
<audio controls>
  <source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/4970-29093-0016.wav" type="audio/wav">
</audio>

๐Ÿ“ **User Prompt**:  
> Transcribe the audio
or
> ู‚ู… ุจุชูุฑูŠุบ ุงู„ู…ู‚ุทุน ุงู„ุตูˆุชูŠ

๐Ÿ’ก **System Response**:
> NO IT'S NOT TOO SOON

---

### Example 2: Arabic to English Translation
๐ŸŽต **Audio Input**:  
<audio controls>
  <source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_21_align.wav" type="audio/wav">
</audio>

๐Ÿ“ **User Prompt**:  
> Translate the following Arabic speech into English
or
> ู‚ู… ุจุชุฑุฌู…ุฉ ุงู„ู…ู‚ุทุน ู„ู„ุฅู†ุฌู„ูŠุฒูŠุฉ

๐Ÿ’ก **System Response**: 
> I took a loan a certain amount of money to pay off the debt

---

### Example 3: Dialect Identification
๐ŸŽต **Audio Input**:  
<audio controls>
  <source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/tYBpZAOFpvk_071631-073831.wav" type="audio/wav">
</audio>

๐Ÿ“ **User Prompt**:  
> Identify the dialect of the given speech
or
> ู…ุงู‡ูŠ ู„ู‡ุฌุฉ ุงู„ู…ุชุญุฏุซุŸ

๐Ÿ’ก **System Response**:
> KSA

---