---
datasets:
- rsalshalan/QASR
- DynamicSuperb/DialectIdentification_ADI17
language:
- ar
- en
metrics:
- bleu
- wer
- accuracy
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
pipeline_tag: audio-text-to-text
---
# TinyOctopus: Bilingual Audio Language Model 🐙🔊

## 📢 Overview
**TinyOctopus** is a **Bilingual Audio Language Model (Audio-LLM)** designed to process and generate text from audio inputs. The model leverages **Distil-Whisper (distil-large-v3)** for audio encoding, a **cross-attention projection layer** for alignment, and **DeepSeek 1.5B** for text generation. TinyOctopus is optimized for tasks such as:

- **Bilingual Automatic Speech Recognition (ASR)** 🗣️
- **Arabic to English Speech Translation** 🌍
- **Spoken Arabic Dialect Identification**

TinyOctopus maintaining the architectural principles of the following structure:

## 🏗 Model Architecture
### **TinyOctopus integrates:**
1. **Distil-Whisper (distil-large-v3)** for encoding audio inputs.
2. **Cross-Attention Projection Layer** (trainable) to align audio features with textual representations.
3. **DeepSeek 1.5B** as the core language model for text generation.

## 📂 Dataset
The model has been trained on multiple datasets to optimize its performance across different tasks:

- **[QASR Dataset](https://arxiv.org/pdf/2106.13000)**: QASR is the largest transcribed Arabic speech corpus, collected from the broadcast domain. It contains **2,000 hours of multi-dialect speech** sampled at **16kHz** from **Al Jazeera News Channel**, with lightly supervised transcriptions aligned with the audio segments. Unlike previous datasets, QASR includes **linguistically motivated segmentation, punctuation, speaker information**, and more. The dataset is suitable for **ASR, Arabic dialect identification, punctuation restoration, speaker identification, and NLP applications**. Additionally, a **130M-word language model dataset** is available to aid language modeling. Speech recognition models trained on QASR achieve competitive **WER** compared to the MGB-2 corpus, and it has been used for downstream tasks like **Named Entity Recognition (NER)** and **punctuation restoration**.

- **[ADI17 Dataset](https://swshon.github.io/pdf/shon_2020_adi17.pdf)**: ADI17 is a **large-scale Arabic Dialect Identification (DID) dataset**, collected from **YouTube videos** across **17 Arabic-speaking countries in the Middle East and North Africa**. It contains **3,000 hours of speech** for training DID systems and an additional **57 hours** for development and testing. The dataset is categorized into **short (<5s), medium (5-20s), and long (>20s) speech segments** for detailed evaluation. ADI17 enables state-of-the-art **dialect identification** and provides a robust evaluation platform. It has been benchmarked on **domain-mismatched conditions** using the Multi-Genre Broadcast 3 (MGB-3) test set.

## ⚙️ Installation & Usage
### **💻 Install Dependencies**
```bash
pip install -r requirements.txt
```
## Inference

```bash
from inference import transcribe

audio_path = "path/to/audio.wav"  # Replace with your actual audio file
output = transcribe(audio_path, task="asr")  # Options: "dialect", "asr", "translation"

print("Generated Text:", output)
```
---

### How to Try It?
You can test the model by uploading or recording your own audio files using the **Gradio demo**:  
➡️ [Try the Model](https://53f0821919c4d2aa02.gradio.live)

---

## Evaluation Results

## ASR Performance (WER & Error Breakdown)
| **Tasks**                           | **WER (%)** | **Substitution (%)** | **Deletion (%)** | **Insertion (%)** |
|:------------------------------------:|:----------:|:--------------------:|:----------------:|:----------------:|
| **ASR_QASR (Arabic)**                | **16.00**  | **9.5**             | **2.7**          | **3.8**          |
| **ASR_ibrispeech&tedlium (English)** | **4.50**   | **3.0**             | **0.8**          | **0.7**          |

---

## Translation Performance (BLEU Scores)
| **Tasks**       | **BLEU (GPT-4o)** | **BLEU (Google)** |
|:--------------:|:----------------:|:----------------:|
| **Translation** | **55.05**        | **43.23**        |

---

## Dialect Identification Accuracy
| **Tasks**                  | **Accuracy (%)** |
|:--------------------------:|:---------------:|
| **Dialect Identification** | **70.59**       |


![Confusion matrix of adi17 test set](https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/images/CM_for_DI.png)

---

## Examples

### Example 1: Arabic Speech Recognition
🎵 **Audio Input (Arabic)**:  
<audio controls>
  <source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_1_align.wav" type="audio/wav">
</audio>

📝 **User Prompt**:  
> Transcribe the audio
or
> قم بتفريغ المقطع الصوتي

💡 **System Response**:
> أهلا بكم مشاهدينا الكرام في حلقة جديدة من برنامج الاقتصاد والناس

🎵 **Audio Input (English)**:  
<audio controls>
  <source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/4970-29093-0016.wav" type="audio/wav">
</audio>

📝 **User Prompt**:  
> Transcribe the audio
or
> قم بتفريغ المقطع الصوتي

💡 **System Response**:
> NO IT'S NOT TOO SOON

---

### Example 2: Arabic to English Translation
🎵 **Audio Input**:  
<audio controls>
  <source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_21_align.wav" type="audio/wav">
</audio>

📝 **User Prompt**:  
> Translate the following Arabic speech into English
or
> قم بترجمة المقطع للإنجليزية

💡 **System Response**: 
> I took a loan a certain amount of money to pay off the debt

---

### Example 3: Dialect Identification
🎵 **Audio Input**:  
<audio controls>
  <source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/tYBpZAOFpvk_071631-073831.wav" type="audio/wav">
</audio>

📝 **User Prompt**:  
> Identify the dialect of the given speech
or
> ماهي لهجة المتحدث؟

💡 **System Response**:
> KSA

---