TinyOctopus / README.md

Update README.md

6c13b89 verified 7 days ago

6.24 kB

	---
	datasets:
	- rsalshalan/QASR
	- DynamicSuperb/DialectIdentification_ADI17
	language:
	- ar
	- en
	metrics:
	- bleu
	- wer
	- accuracy
	base_model:
	- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
	pipeline_tag: audio-text-to-text
	---
	# TinyOctopus: Bilingual Audio Language Model 🐙🔊

	## 📢 Overview
	TinyOctopus is a Bilingual Audio Language Model (Audio-LLM) designed to process and generate text from audio inputs. The model leverages Distil-Whisper (distil-large-v3) for audio encoding, a cross-attention projection layer for alignment, and DeepSeek 1.5B for text generation. TinyOctopus is optimized for tasks such as:

	- Bilingual Automatic Speech Recognition (ASR) 🗣️
	- Arabic to English Speech Translation 🌍
	- Spoken Arabic Dialect Identification

	TinyOctopus maintaining the architectural principles of the following structure:

	## 🏗 Model Architecture
	### TinyOctopus integrates:
	1. Distil-Whisper (distil-large-v3) for encoding audio inputs.
	2. Cross-Attention Projection Layer (trainable) to align audio features with textual representations.
	3. DeepSeek 1.5B as the core language model for text generation.

	## 📂 Dataset
	The model has been trained on multiple datasets to optimize its performance across different tasks:

	- [QASR Dataset](https://arxiv.org/pdf/2106.13000): QASR is the largest transcribed Arabic speech corpus, collected from the broadcast domain. It contains 2,000 hours of multi-dialect speech sampled at 16kHz from Al Jazeera News Channel, with lightly supervised transcriptions aligned with the audio segments. Unlike previous datasets, QASR includes linguistically motivated segmentation, punctuation, speaker information, and more. The dataset is suitable for ASR, Arabic dialect identification, punctuation restoration, speaker identification, and NLP applications. Additionally, a 130M-word language model dataset is available to aid language modeling. Speech recognition models trained on QASR achieve competitive WER compared to the MGB-2 corpus, and it has been used for downstream tasks like Named Entity Recognition (NER) and punctuation restoration.

	- [ADI17 Dataset](https://swshon.github.io/pdf/shon_2020_adi17.pdf): ADI17 is a large-scale Arabic Dialect Identification (DID) dataset, collected from YouTube videos across 17 Arabic-speaking countries in the Middle East and North Africa. It contains 3,000 hours of speech for training DID systems and an additional 57 hours for development and testing. The dataset is categorized into short (<5s), medium (5-20s), and long (>20s) speech segments for detailed evaluation. ADI17 enables state-of-the-art dialect identification and provides a robust evaluation platform. It has been benchmarked on domain-mismatched conditions using the Multi-Genre Broadcast 3 (MGB-3) test set.

	## ⚙️ Installation & Usage
	### 💻 Install Dependencies
	```bash
	pip install -r requirements.txt
	```
	## Inference

	```bash
	from inference import transcribe

	audio_path = "path/to/audio.wav" # Replace with your actual audio file
	output = transcribe(audio_path, task="asr") # Options: "dialect", "asr", "translation"

	print("Generated Text:", output)
	```
	---

	### How to Try It?
	You can test the model by uploading or recording your own audio files using the Gradio demo:
	➡️ [Try the Model](https://53f0821919c4d2aa02.gradio.live)

	---

	## Evaluation Results

	## ASR Performance (WER & Error Breakdown)
	\| Tasks \| WER (%) \| Substitution (%) \| Deletion (%) \| Insertion (%) \|
	\|:------------------------------------:\|:----------:\|:--------------------:\|:----------------:\|:----------------:\|
	\| ASR_QASR (Arabic) \| 16.00 \| 9.5 \| 2.7 \| 3.8 \|
	\| ASR_ibrispeech&tedlium (English) \| 4.50 \| 3.0 \| 0.8 \| 0.7 \|

	---

	## Translation Performance (BLEU Scores)
	\| Tasks \| BLEU (GPT-4o) \| BLEU (Google) \|
	\|:--------------:\|:----------------:\|:----------------:\|
	\| Translation \| 55.05 \| 43.23 \|

	---

	## Dialect Identification Accuracy
	\| Tasks \| Accuracy (%) \|
	\|:--------------------------:\|:---------------:\|
	\| Dialect Identification \| 70.59 \|


	![Confusion matrix of adi17 test set](https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/images/CM_for_DI.png)

	---

	## Examples

	### Example 1: Arabic Speech Recognition
	🎵 Audio Input (Arabic):
	<audio controls>
	<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_1_align.wav" type="audio/wav">
	</audio>

	📝 User Prompt:
	> Transcribe the audio
	or
	> قم بتفريغ المقطع الصوتي

	💡 System Response:
	> أهلا بكم مشاهدينا الكرام في حلقة جديدة من برنامج الاقتصاد والناس

	🎵 Audio Input (English):
	<audio controls>
	<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/4970-29093-0016.wav" type="audio/wav">
	</audio>

	📝 User Prompt:
	> Transcribe the audio
	or
	> قم بتفريغ المقطع الصوتي

	💡 System Response:
	> NO IT'S NOT TOO SOON

	---

	### Example 2: Arabic to English Translation
	🎵 Audio Input:
	<audio controls>
	<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/03BD00C0_2C0B_4C81_BA8C_018175D0B4E3_utt_21_align.wav" type="audio/wav">
	</audio>

	📝 User Prompt:
	> Translate the following Arabic speech into English
	or
	> قم بترجمة المقطع للإنجليزية

	💡 System Response:
	> I took a loan a certain amount of money to pay off the debt

	---

	### Example 3: Dialect Identification
	🎵 Audio Input:
	<audio controls>
	<source src="https://huggingface.co/SaraAlthubaiti/TinyOctopus/resolve/main/examples/tYBpZAOFpvk_071631-073831.wav" type="audio/wav">
	</audio>

	📝 User Prompt:
	> Identify the dialect of the given speech
	or
	> ماهي لهجة المتحدث؟

	💡 System Response:
	> KSA

	---