---
language: fa
tags:
- persian
- mobilebert
license: apache-2.0
pipeline_tag: fill-mask
mask_token: '[MASK]'
widget:
- text: 'در همین لحظه که شما مشغول [MASK] این متن هستید، میلیونها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمعآوری، پردازش و تحلیل این کلان داده (Big Data) میپردازیم.'
---
#
Lifeweb
### Shiraz Language Model
Welcome to Shiraz, the repository for Lifeweb's language model.
First versions of our models are all trained on our own dataset called **Divan** with more than **164 million documents** and more than **10B tokens** which is normalized and deduplicated meticulously to ensure its enrichment and comprehensiveness. A better dataset leads to a better model!
# Use Model
You can easily access the models using the sample code provided below.
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM, FillMaskPipeline
# v1.0
model_name = "lifeweb-ai/shiraz"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)
text = "در همین لحظه که شما مشغول خواندن این متن هستید، میلیونها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمعآوری، پردازش و تحلیل این کلان داده (Big Data) میپردازیم."
print(tokenizer.tokenize(text))
# ['در', 'همین', 'لحظه', 'که', 'شما', 'مشغول', 'خواندن', 'این', 'متن', 'هستید،', 'میلیون', '[zwnj]', 'ها', 'دیتا', 'در', 'فضای', 'انلاین', 'در', 'حال', 'تولید', 'است', '.', 'ما', 'در', 'لایف', 'وب', 'به', 'جمع', '[zwnj]', 'اوری', '##،', 'پردازش', 'و', 'تحلیل', 'این', 'کلان', 'داده', '(', 'big', 'data', ')', 'می', '[zwnj]', 'پردازیم', '.', '.']
# fill mask task
text = "در همین لحظه که شما مشغول [MASK] این متن هستید، میلیونها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمعآوری، پردازش و تحلیل این کلان داده (Big Data) میپردازیم."
classifier = FillMaskPipeline(model=model, tokenizer=tokenizer)
result = classifier(text)
print(result[0])
#{'score': 0.3584367036819458, 'token': 5764, 'token_str': 'خواندن', 'sequence': 'در همین لحظه که شما مشغول خواندن این متن هستید، میلیون ها دیتا در فضای انلاین در حال تولید است. ما در لایف وب به جمع اوری، پردازش و تحلیل این کلان داده ( big data ) می پردازیم.'}
```
# Results
The **Shiraz** is evaluated on three downstream NLP tasks comprising **NER**, **Sentiment Analysis**, and **Emotion Detection**. Shiraz is considerably faster, and its accuracy remains highly competitive without compromising much on speed. According to [**MobileBERT paper**](https://arxiv.org/pdf/2004.02984.pdf), this model is 4.3× smaller and 5.5× faster than BERT-base.
Obvious from the table below, you can find the colab codes for each task to use as a tutorial besides the macro F1 score.
Model |
NER |
Sentiment |
Emotion |
|
Arman |
Peyma |
Sentipers (multi) |
Snappfood |
Arman |
lifeweb-ai/tehran |
71.87%
90.79%
63.75%
88.74%
77.73%
| | | | |
lifeweb-ai/shiraz |
67.62%
 |
86.24%
 |
59.17%
 |
88.01%
 |
66.97%
 |
sbunlp/fabert |
71.23%
 |
88.53%
 |
58.51%
 |
88.60%
 |
72.65%
 |
ViraIntelligentDataMining/AriaBERT |
69.12%
 |
87.15%
 |
59.26%
 |
87.96%
 |
69.11%
 |
HooshvareLab/bert-fa-zwnj-base |
67.49%
 |
85.73%
 |
59.61%
 |
87.58%
 |
59.27%
 |
HooshvareLab/roberta-fa-zwnj-base |
69.73%
 |
86.21%
 |
56.23%
 |
87.19%
 |
57.96%
 |
If you tested our models on a public dataset, and you wanted to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so that we can add a reference.
# Cite
You are welcome to use our LM models in your work or research, if so, we kindly ask you to cite it using the following entry:
```
@misc{Shiraz,
author = {Mehrdad Azizi, Reza Salehi Chegeni, Parisa Mousavi, Iman Hashemi},
title = {[Optimizing Pre-trained BERT-based Models for Persian Language Processing]},
year = {2024},
publisher = {LifeWeb}
}
```
# Contributors
- Mehrdad Azizi: [**Linkedin**](https://www.linkedin.com/in/mehrdad-azizi-50839489/), [**Github**](https://github.com/mehrazi)
- Reza Salehi Chegeni: [**Linkedin**](https://www.linkedin.com/in/reza-salehi-chegeni-6988ba271/), [**Github**](https://github.com/rezasalehichegeni)
- Parisa Mousavi: [**Linkedin**](https://www.linkedin.com/in/seyede-parisa-mousavi/), [**Github**](https://github.com/Mousavi-Parisa)
- Iman Hashemi: [**Linkedin**](https://www.linkedin.com/in/iman-hashemi-403738a5), [**Github**](https://github.com/hashemiiman)
- Lifeweb: [**HuggingFace**](https://huggingface.co/lifeweb-ai), [**Official Website**](https://lifewebco.com/), [**Linkedin**](https://www.linkedin.com/company/lifewebir/mycompany/)
# Releases
**v1.0(2024-03-09)**
First version of **Shiraz** model trained on **DIVAN**.