Overview

Ukhbert is a pair of BERT base transformer models for use with classical arabic texts. Specifically, it's used with reports composed of a isnad, or chain of narrators that transmit the report, and the matn, the actual content. Ukhbert is used to identify narrators within a isnad from the text of the report.

An example of matn and isnad

For example, consider the following text, which is a hadith or attribution to the Prophet Muhammad:

ุญูŽุฏู‘ูŽุซูŽู†ูŽุง ู‚ูุชูŽูŠู’ุจูŽุฉู ุจู’ู†ู ุณูŽุนููŠุฏูุŒ ุญูŽุฏู‘ูŽุซูŽู†ูŽุง ุนูŽุจู’ุฏู ุงู„ู’ูˆูŽู‡ู‘ูŽุงุจูุŒ ู‚ูŽุงู„ูŽ ุณูŽู…ูุนู’ุชู ูŠูŽุญู’ูŠูŽู‰ ุจู’ู†ูŽ ุณูŽุนููŠุฏูุŒ ูŠูŽู‚ููˆู„ู ุฃูŽุฎู’ุจูŽุฑูŽู†ููŠ ู…ูุญูŽู…ู‘ูŽุฏู ุจู’ู†ู ุฅูุจู’ุฑูŽุงู‡ููŠู…ูŽุŒ ุฃูŽู†ู‘ูŽู‡ู ุณูŽู…ูุนูŽ ุนูŽู„ู’ู‚ูŽู…ูŽุฉูŽ ุจู’ู†ูŽ ูˆูŽู‚ู‘ูŽุงุตู ุงู„ู„ู‘ูŽูŠู’ุซููŠู‘ูŽุŒ ูŠูŽู‚ููˆู„ู ุณูŽู…ูุนู’ุชู ุนูู…ูŽุฑูŽ ุจู’ู†ูŽ ุงู„ู’ุฎูŽุทู‘ูŽุงุจู ู€ ุฑุถู‰ ุงู„ู„ู‡ ุนู†ู‡ ู€ ูŠูŽู‚ููˆู„ู ุณูŽู…ูุนู’ุชู ุฑูŽุณููˆู„ูŽ ุงู„ู„ู‘ูŽู‡ู ุตู„ู‰ ุงู„ู„ู‡ ุนู„ูŠู‡ ูˆุณู„ู… ูŠูŽู‚ููˆู„ู โ€ "โ€ ุฅูู†ู‘ูŽู…ูŽุง ุงู„ุฃูŽุนู’ู…ูŽุงู„ู ุจูุงู„ู†ู‘ููŠู‘ูŽุฉูุŒ ูˆูŽุฅูู†ู‘ูŽู…ูŽุง ู„ุงูู…ู’ุฑูุฆู ู…ูŽุง ู†ูŽูˆูŽู‰ุŒ ููŽู…ูŽู†ู’ ูƒูŽุงู†ูŽุชู’ ู‡ูุฌู’ุฑูŽุชูู‡ู ุฅูู„ูŽู‰ ุงู„ู„ู‘ูŽู‡ู ูˆูŽุฑูŽุณููˆู„ูู‡ู ููŽู‡ูุฌู’ุฑูŽุชูู‡ู ุฅูู„ูŽู‰ ุงู„ู„ู‘ูŽู‡ู ูˆูŽุฑูŽุณููˆู„ูู‡ูุŒ ูˆูŽู…ูŽู†ู’ ูƒูŽุงู†ูŽุชู’ ู‡ูุฌู’ุฑูŽุชูู‡ู ุฅูู„ูŽู‰ ุฏูู†ู’ูŠูŽุง ูŠูุตููŠุจูู‡ูŽุง ุฃูŽูˆู ุงู…ู’ุฑูŽุฃูŽุฉู ูŠูŽุชูŽุฒูŽูˆู‘ูŽุฌูู‡ูŽุงุŒ ููŽู‡ูุฌู’ุฑูŽุชูู‡ู ุฅูู„ูŽู‰ ู…ูŽุง ู‡ูŽุงุฌูŽุฑูŽ ุฅูู„ูŽูŠู’ู‡ู โ€โ€.โ€

Narrated `Umar bin Al-Khattab: I heard Allah's Messenger (๏ทบ) saying, "The (reward of) deeds, depend upon the intentions and every person will get the reward according to what he has intended. So whoever emigrated for the sake of Allah and His Apostle, then his emigration will be considered to be for Allah and His Apostle, and whoever emigrated for the sake of worldly gain or for a woman to marry, then his emigration will be considered to be for what he emigrated for."

Sahih al-Bukhari 6689 https://sunnah.com/bukhari:6689

The isnad is this:

ุญูŽุฏู‘ูŽุซูŽู†ูŽุง ู‚ูุชูŽูŠู’ุจูŽุฉู ุจู’ู†ู ุณูŽุนููŠุฏูุŒ ุญูŽุฏู‘ูŽุซูŽู†ูŽุง ุนูŽุจู’ุฏู ุงู„ู’ูˆูŽู‡ู‘ูŽุงุจูุŒ ู‚ูŽุงู„ูŽ ุณูŽู…ูุนู’ุชู ูŠูŽุญู’ูŠูŽู‰ ุจู’ู†ูŽ ุณูŽุนููŠุฏูุŒ ูŠูŽู‚ููˆู„ู ุฃูŽุฎู’ุจูŽุฑูŽู†ููŠ ู…ูุญูŽู…ู‘ูŽุฏู ุจู’ู†ู ุฅูุจู’ุฑูŽุงู‡ููŠู…ูŽุŒ ุฃูŽู†ู‘ูŽู‡ู ุณูŽู…ูุนูŽ ุนูŽู„ู’ู‚ูŽู…ูŽุฉูŽ ุจู’ู†ูŽ ูˆูŽู‚ู‘ูŽุงุตู ุงู„ู„ู‘ูŽูŠู’ุซููŠู‘ูŽุŒ ูŠูŽู‚ููˆู„ู ุณูŽู…ูุนู’ุชู ุนูู…ูŽุฑูŽ ุจู’ู†ูŽ ุงู„ู’ุฎูŽุทู‘ูŽุงุจู

Qutaybah bin Saeed narrated from Abdul Wahhab who said he heard Yahya bin Saeed who said it was reported to him from Muhammad bin Ibrahim who heard Alqamah bin Waqqaa al Laythi, who said he heard from Umar bin Khattab.

The use of Ukhbert is to first identify the names narrators from the text, and then find out who they are.

Ukhbert narrator Detection

This is the first half of the ukhbert pair. It identifies the names of narrators from a text. It classifies the tokens of the text into 3 classes: Begining Narrator "B-NAR", Intermediate Narrrator "I-NAR", and other "O". For finding out names of narrators, we gather the "NAR"s, as names of narrators are composed of the concatenation of B-NAR with zero or more I-NARs.

Usage

It is strongly recommended to use pipeline from the transformers library for inference. Make sure to preprocess the text by removing diacritics, and ensuring only arabic text is in the corpus.

Here is a working example:

from transformers import pipeline
pipe = pipeline('ner', model='HikmaLabs/ukhbert_narrator_detection', tokenizer='HikmaLabs/ukhbert_narrator_detection', config='HikmaLabs/ukhbert_narrator_detection',aggregation_strategy="simple")
# you can also use a list of strings, HF datasets, as per pipelines documentation
hadith = 'ุญุฏุซู†ุง ู‚ุชูŠุจุฉ ุจู† ุณุนูŠุฏุŒ ุญุฏุซู†ุง ุนุจุฏ ุงู„ูˆู‡ุงุจุŒ ู‚ุงู„ ุณู…ุนุช ูŠุญูŠู‰ ุจู† ุณุนูŠุฏุŒ ูŠู‚ูˆู„ ุฃุฎุจุฑู†ูŠ ู…ุญู…ุฏ ุจู† ุฅุจุฑุงู‡ูŠู…ุŒ ุฃู†ู‡ ุณู…ุน ุนู„ู‚ู…ุฉ ุจู† ูˆู‚ุงุต ุงู„ู„ูŠุซูŠุŒ ูŠู‚ูˆู„ ุณู…ุนุช ุนู…ุฑ ุจู† ุงู„ุฎุทุงุจ ู€ ุฑุถู‰ ุงู„ู„ู‡ ุนู†ู‡ ู€ ูŠู‚ูˆู„ ุณู…ุนุช ุฑุณูˆู„ ุงู„ู„ู‡ ุตู„ู‰ ุงู„ู„ู‡ ุนู„ูŠู‡ ูˆุณู„ู… ูŠู‚ูˆู„ " ุฅู†ู…ุง ุงู„ุฃุนู…ุงู„ ุจุงู„ู†ูŠุฉุŒ ูˆุฅู†ู…ุง ู„ุงู…ุฑุฆ ู…ุง ู†ูˆู‰ุŒ ูู…ู† ูƒุงู†ุช ู‡ุฌุฑุชู‡ ุฅู„ู‰ ุงู„ู„ู‡ ูˆุฑุณูˆู„ู‡ ูู‡ุฌุฑุชู‡ ุฅู„ู‰ ุงู„ู„ู‡ ูˆุฑุณูˆู„ู‡ุŒ ูˆู…ู† ูƒุงู†ุช ู‡ุฌุฑุชู‡ ุฅู„ู‰ ุฏู†ูŠุง ูŠุตูŠุจู‡ุง ุฃูˆ ุงู…ุฑุฃุฉ ูŠุชุฒูˆุฌู‡ุงุŒ ูู‡ุฌุฑุชู‡ ุฅู„ู‰ ู…ุง ู‡ุงุฌุฑ ุฅู„ูŠู‡ "'
ner_results = pipe(hadith)

The output of ner_results is this:

[{'entity_group': 'NAR',
  'score': np.float32(0.99935853),
  'word': 'ู‚ุชูŠุจุฉ ุจู† ุณุนูŠุฏ',
  'start': 6,
  'end': 19},
 {'entity_group': 'NAR',
  'score': np.float32(0.9982073),
  'word': 'ุนุจุฏ ุงู„ูˆู‡ุงุจ',
  'start': 27,
  'end': 37},
 {'entity_group': 'NAR',
  'score': np.float32(0.998333),
  'word': 'ูŠุญูŠู‰ ุจู† ุณุนูŠุฏ',
  'start': 48,
  'end': 60},
 {'entity_group': 'NAR',
  'score': np.float32(0.9989322),
  'word': 'ู…ุญู…ุฏ ุจู† ุฅุจุฑุงู‡ูŠู…',
  'start': 74,
  'end': 89},
 {'entity_group': 'NAR',
  'score': np.float32(0.99043906),
  'word': 'ุนู„ู‚ู…ุฉ ุจู† ูˆู‚ุงุต ุงู„ู„ูŠุซูŠ',
  'start': 99,
  'end': 119},
 {'entity_group': 'NAR',
  'score': np.float32(0.9782518),
  'word': 'ุนู…ุฑ ุจู† ุงู„ุฎุทุงุจ',
  'start': 131,
  'end': 144}]

Training

Ukhbert is finetuned from the ARABERT v0.1 model. For more details, please consult the referenced paper, "Learning to identify Narrators in Classical Islamic Texts"

Citing

If you use the model, please cite this paper:

@article{ALKAOUD2021335,
title = {Learning to Identify Narrators in Classical Arabic Texts},
journal = {Procedia Computer Science},
volume = {189},
pages = {335-342},
year = {2021},
note = {AI in Computational Linguistics},
issn = {1877-0509},
doi = {https://doi.org/10.1016/j.procs.2021.05.109},
url = {https://www.sciencedirect.com/science/article/pii/S1877050921012369},
author = {Mohamed Alkaoud and Mairaj Syed},
keywords = {NLP, Classical Arabic, Entity linking, Named-entity recognition, Digital humanities, Hadith science},
abstract = {One widespread historical method of transmitting and recording information about important events and people in the Middle East is the narration-based method. In this method, each saying about a person or event is transmitted from person to person until a systematic collector records and compiles such sayings in a stable collection. At each stage of transmission, the narrator not only transmits the saying but also the person he got it from going back to the earliest narrator. Identifying each narrator in these collections is important to better measure the accuracy of the narrations and identify the date and geographies of their circulation. In this work, we propose a natural language processing technique to automate the identification of narrators in classical Arabic texts. Our proposed technique consists of two models: 1) a model for detecting the narrators in the text, and 2) a model for linking narrators to their biographies. We train our two models on a large collection of annotated classical Arabic texts and achieve F1-scores of 96.15% and 95.74% for narration detection and linking respectively.}
}
Downloads last month
44
Safetensors
Model size
177M params
Tensor type
I64
ยท
F32
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for HikmaLabs/ukhbert_narrator_detection

Finetuned
(2)
this model