--- language: - en tags: - text-classification metrics: - accuracy (balanced) - F1 (weighted) widget: - text: "اسعدغيرك انت مو بس اسعدت العماله ترا اسعدتنا" example_title: "خليجي" - text: " سبحان الله في الغيوم شكل قلب" example_title: "فصحي" - text: "بلاش تحطي صور متبرجة ع صفحتك..." example_title: "خليجي" - text: "و حضرتك طيبة و شكرا علي الكلام الحلو ده يا مبهجة..." example_title: "مصري" --- # Dialectical-MSA-detection ## Model description This model was trained on 108,173 manually annotated User-Generated Content (e.g. tweets and online comments) to classify the Arabic language of the text into one of two categories: 'Dialectical', or 'MSA' (i.e. Modern Standard Arabic). ## Training data Dialectical-MSA-detection was trained on the English-speaking subset of the [The Arabic online commentary dataset (Zaidan, et al 20211)](https://github.com/sjeblee/AOC). The AOC dataset was created by crawling the websites of three Arabic newspapers, and extracting online articles and readers' comments. ## Training procedure `xlm-roberta-base` was trained using the Hugging Face trainer with the following hyperparameters. ``` training_args = TrainingArguments( num_train_epochs=4, # total number of training epochs learning_rate=2e-5, # learning rate per_device_train_batch_size=32, # batch size per device during training per_device_eval_batch_size=4, # batch size for evaluation warmup_steps=0, # number of warmup steps for learning rate scheduler weight_decay=0.02, # strength of weight decay ) ``` ## Eval results The model was evaluated using 10% of the sentences (90-10 train-dev split). Accuracy 0.88 on the dev set. ## Limitations and bias The model was trained on sentences from the online commentary domain. Other forms of UGT such as tweet can be different in the degree of dialectness. ### BibTeX entry and citation info ```bibtex @article{saadany2022semi, title={A Semi-supervised Approach for a Better Translation of Sentiment in Dialectical Arabic UGT}, author={Saadany, Hadeel and Orasan, Constantin and Mohamed, Emad and Tantawy, Ashraf}, journal={arXiv preprint arXiv:2210.11899}, year={2022} } ```