|
--- |
|
datasets: |
|
- rsalshalan/QASR |
|
- DynamicSuperb/DialectIdentification_ADI17 |
|
language: |
|
- ar |
|
- en |
|
metrics: |
|
- bleu |
|
- wer |
|
- accuracy |
|
base_model: |
|
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B |
|
pipeline_tag: audio-text-to-text |
|
--- |
|
# TinyOctopus: Bilingual Audio Language Model ๐๐ |
|
|
|
## ๐ข Overview |
|
**TinyOctopus** is a **Bilingual Audio Language Model (Audio-LLM)** designed to process and generate text from audio inputs. The model leverages **Distil-Whisper (distil-large-v3)** for audio encoding, a **cross-attention projection layer** for alignment, and **DeepSeek 1.5B** for text generation. TinyOctopus is optimized for tasks such as: |
|
|
|
- **Bilingual Automatic Speech Recognition (ASR)** ๐ฃ๏ธ |
|
- **Arabic to English Speech Translation** ๐ |
|
- **Spoken Arabic Dialect Identification** |
|
|
|
TinyOctopus maintaining the architectural principles of the following structure: |
|
|
|
## ๐ Model Architecture |
|
### **TinyOctopus integrates:** |
|
1. **Distil-Whisper (distil-large-v3)** for encoding audio inputs. |
|
2. **Cross-Attention Projection Layer** (trainable) to align audio features with textual representations. |
|
3. **DeepSeek 1.5B** as the core language model for text generation. |
|
|
|
## ๐ Dataset |
|
The model has been trained on multiple datasets to optimize its performance across different tasks: |
|
|
|
- **[QASR Dataset](https://arxiv.org/pdf/2106.13000)**: QASR is the largest transcribed Arabic speech corpus, collected from the broadcast domain. It contains **2,000 hours of multi-dialect speech** sampled at **16kHz** from **Al Jazeera News Channel**, with lightly supervised transcriptions aligned with the audio segments. Unlike previous datasets, QASR includes **linguistically motivated segmentation, punctuation, speaker information**, and more. The dataset is suitable for **ASR, Arabic dialect identification, punctuation restoration, speaker identification, and NLP applications**. Additionally, a **130M-word language model dataset** is available to aid language modeling. Speech recognition models trained on QASR achieve competitive **WER** compared to the MGB-2 corpus, and it has been used for downstream tasks like **Named Entity Recognition (NER)** and **punctuation restoration**. |
|
|
|
- **[ADI17 Dataset](https://swshon.github.io/pdf/shon_2020_adi17.pdf)**: ADI17 is a **large-scale Arabic Dialect Identification (DID) dataset**, collected from **YouTube videos** across **17 Arabic-speaking countries in the Middle East and North Africa**. It contains **3,000 hours of speech** for training DID systems and an additional **57 hours** for development and testing. The dataset is categorized into **short (<5s), medium (5-20s), and long (>20s) speech segments** for detailed evaluation. ADI17 enables state-of-the-art **dialect identification** and provides a robust evaluation platform. It has been benchmarked on **domain-mismatched conditions** using the Multi-Genre Broadcast 3 (MGB-3) test set. |
|
|
|
## โ๏ธ Installation & Usage |
|
### **๐ป Install Dependencies** |
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
## Inference |
|
|
|
```bash |
|
from inference import transcribe |
|
|
|
audio_path = "path/to/audio.wav" # Replace with your actual audio file |
|
output = transcribe(audio_path, task="asr") # Options: "dialect", "asr", "translation" |
|
|
|
print("Generated Text:", output) |
|
``` |
|
--- |
|
|
|
### How to Try It? |
|
You can test the model by uploading or recording your own audio files using the **Gradio demo**: |
|
โก๏ธ [Try the Model](https://53f0821919c4d2aa02.gradio.live) |
|
|
|
--- |
|
|
|
## Evaluation Results |
|
|
|
## ASR Performance (WER & Error Breakdown) |
|
| **Tasks** | **WER (%)** | **Substitution (%)** | **Deletion (%)** | **Insertion (%)** | |
|
|:------------------------------------:|:----------:|:--------------------:|:----------------:|:----------------:| |
|
| **ASR_QASR (Arabic)** | **16.00** | **9.5** | **2.7** | **3.8** | |
|
| **ASR_ibrispeech&tedlium (English)** | **4.50** | **3.0** | **0.8** | **0.7** | |
|
|
|
--- |
|
|
|
## Translation Performance (BLEU Scores) |
|
| **Tasks** | **BLEU (GPT-4o)** | **BLEU (Google)** | |
|
|:--------------:|:----------------:|:----------------:| |
|
| **Translation** | **55.05** | **43.23** | |
|
|
|
--- |
|
|
|
## Dialect Identification Accuracy |
|
| **Tasks** | **Accuracy (%)** | |
|
|:--------------------------:|:---------------:| |
|
| **Dialect Identification** | **70.59** | |
|
|
|
|
|
 |
|
|
|
--- |
|
|
|
## Examples |
|
|
|
### Example 1: Arabic Speech Recognition |
|
๐ต **Audio Input (Arabic)**: |
|
<audio controls> |
|
<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_1_align.wav" type="audio/wav"> |
|
</audio> |
|
|
|
๐ **User Prompt**: |
|
> Transcribe the audio |
|
or |
|
> ูู
ุจุชูุฑูุบ ุงูู
ูุทุน ุงูุตูุชู |
|
|
|
๐ก **System Response**: |
|
> ุฃููุง ุจูู
ู
ุดุงูุฏููุง ุงููุฑุงู
ูู ุญููุฉ ุฌุฏูุฏุฉ ู
ู ุจุฑูุงู
ุฌ ุงูุงูุชุตุงุฏ ูุงููุงุณ |
|
|
|
๐ต **Audio Input (English)**: |
|
<audio controls> |
|
<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/4970-29093-0016.wav" type="audio/wav"> |
|
</audio> |
|
|
|
๐ **User Prompt**: |
|
> Transcribe the audio |
|
or |
|
> ูู
ุจุชูุฑูุบ ุงูู
ูุทุน ุงูุตูุชู |
|
|
|
๐ก **System Response**: |
|
> NO IT'S NOT TOO SOON |
|
|
|
--- |
|
|
|
### Example 2: Arabic to English Translation |
|
๐ต **Audio Input**: |
|
<audio controls> |
|
<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_21_align.wav" type="audio/wav"> |
|
</audio> |
|
|
|
๐ **User Prompt**: |
|
> Translate the following Arabic speech into English |
|
or |
|
> ูู
ุจุชุฑุฌู
ุฉ ุงูู
ูุทุน ููุฅูุฌููุฒูุฉ |
|
|
|
๐ก **System Response**: |
|
> I took a loan a certain amount of money to pay off the debt |
|
|
|
--- |
|
|
|
### Example 3: Dialect Identification |
|
๐ต **Audio Input**: |
|
<audio controls> |
|
<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/tYBpZAOFpvk_071631-073831.wav" type="audio/wav"> |
|
</audio> |
|
|
|
๐ **User Prompt**: |
|
> Identify the dialect of the given speech |
|
or |
|
> ู
ุงูู ููุฌุฉ ุงูู
ุชุญุฏุซุ |
|
|
|
๐ก **System Response**: |
|
> KSA |
|
|
|
--- |