Update README.md
Browse files
README.md
CHANGED
@@ -36,4 +36,10 @@ The model has been trained on multiple datasets to optimize its performance acro
|
|
36 |
|
37 |
- **[QASR Dataset](https://arxiv.org/pdf/2106.13000)**: QASR is the largest transcribed Arabic speech corpus, collected from the broadcast domain. It contains **2,000 hours of multi-dialect speech** sampled at **16kHz** from **Al Jazeera News Channel**, with lightly supervised transcriptions aligned with the audio segments. Unlike previous datasets, QASR includes **linguistically motivated segmentation, punctuation, speaker information**, and more. The dataset is suitable for **ASR, Arabic dialect identification, punctuation restoration, speaker identification, and NLP applications**. Additionally, a **130M-word language model dataset** is available to aid language modeling. Speech recognition models trained on QASR achieve competitive **WER** compared to the MGB-2 corpus, and it has been used for downstream tasks like **Named Entity Recognition (NER)** and **punctuation restoration**.
|
38 |
|
39 |
-
- **[ADI17 Dataset](https://swshon.github.io/pdf/shon_2020_adi17.pdf)**: ADI17 is a **large-scale Arabic Dialect Identification (DID) dataset**, collected from **YouTube videos** across **17 Arabic-speaking countries in the Middle East and North Africa**. It contains **3,000 hours of speech** for training DID systems and an additional **57 hours** for development and testing. The dataset is categorized into **short (<5s), medium (5-20s), and long (>20s) speech segments** for detailed evaluation. ADI17 enables state-of-the-art **dialect identification** and provides a robust evaluation platform. It has been benchmarked on **domain-mismatched conditions** using the Multi-Genre Broadcast 3 (MGB-3) test set.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
- **[QASR Dataset](https://arxiv.org/pdf/2106.13000)**: QASR is the largest transcribed Arabic speech corpus, collected from the broadcast domain. It contains **2,000 hours of multi-dialect speech** sampled at **16kHz** from **Al Jazeera News Channel**, with lightly supervised transcriptions aligned with the audio segments. Unlike previous datasets, QASR includes **linguistically motivated segmentation, punctuation, speaker information**, and more. The dataset is suitable for **ASR, Arabic dialect identification, punctuation restoration, speaker identification, and NLP applications**. Additionally, a **130M-word language model dataset** is available to aid language modeling. Speech recognition models trained on QASR achieve competitive **WER** compared to the MGB-2 corpus, and it has been used for downstream tasks like **Named Entity Recognition (NER)** and **punctuation restoration**.
|
38 |
|
39 |
+
- **[ADI17 Dataset](https://swshon.github.io/pdf/shon_2020_adi17.pdf)**: ADI17 is a **large-scale Arabic Dialect Identification (DID) dataset**, collected from **YouTube videos** across **17 Arabic-speaking countries in the Middle East and North Africa**. It contains **3,000 hours of speech** for training DID systems and an additional **57 hours** for development and testing. The dataset is categorized into **short (<5s), medium (5-20s), and long (>20s) speech segments** for detailed evaluation. ADI17 enables state-of-the-art **dialect identification** and provides a robust evaluation platform. It has been benchmarked on **domain-mismatched conditions** using the Multi-Genre Broadcast 3 (MGB-3) test set.
|
40 |
+
|
41 |
+
## ⚙️ Installation & Usage
|
42 |
+
### **💻 Install Dependencies**
|
43 |
+
```bash
|
44 |
+
pip install -r requirements.txt
|
45 |
+
|