About Ukhbert

Ukhbert is a pair of BERT base transformer models for use with classical arabic texts. Specifically, it's used with reports composed of a isnad, or chain of narrators that transmit the report, and the matn, the actual content. Ukhbert is used to identify narrators within a isnad from the text of the report.

An example of matn and isnad

For example, consider the following text, which is a hadith or attribution to the Prophet Muhammad:

ุญูŽุฏู‘ูŽุซูŽู†ูŽุง ู‚ูุชูŽูŠู’ุจูŽุฉู ุจู’ู†ู ุณูŽุนููŠุฏูุŒ ุญูŽุฏู‘ูŽุซูŽู†ูŽุง ุนูŽุจู’ุฏู ุงู„ู’ูˆูŽู‡ู‘ูŽุงุจูุŒ ู‚ูŽุงู„ูŽ ุณูŽู…ูุนู’ุชู ูŠูŽุญู’ูŠูŽู‰ ุจู’ู†ูŽ ุณูŽุนููŠุฏูุŒ ูŠูŽู‚ููˆู„ู ุฃูŽุฎู’ุจูŽุฑูŽู†ููŠ ู…ูุญูŽู…ู‘ูŽุฏู ุจู’ู†ู ุฅูุจู’ุฑูŽุงู‡ููŠู…ูŽุŒ ุฃูŽู†ู‘ูŽู‡ู ุณูŽู…ูุนูŽ ุนูŽู„ู’ู‚ูŽู…ูŽุฉูŽ ุจู’ู†ูŽ ูˆูŽู‚ู‘ูŽุงุตู ุงู„ู„ู‘ูŽูŠู’ุซููŠู‘ูŽุŒ ูŠูŽู‚ููˆู„ู ุณูŽู…ูุนู’ุชู ุนูู…ูŽุฑูŽ ุจู’ู†ูŽ ุงู„ู’ุฎูŽุทู‘ูŽุงุจู ู€ ุฑุถู‰ ุงู„ู„ู‡ ุนู†ู‡ ู€ ูŠูŽู‚ููˆู„ู ุณูŽู…ูุนู’ุชู ุฑูŽุณููˆู„ูŽ ุงู„ู„ู‘ูŽู‡ู ุตู„ู‰ ุงู„ู„ู‡ ุนู„ูŠู‡ ูˆุณู„ู… ูŠูŽู‚ููˆู„ู โ€ "โ€ ุฅูู†ู‘ูŽู…ูŽุง ุงู„ุฃูŽุนู’ู…ูŽุงู„ู ุจูุงู„ู†ู‘ููŠู‘ูŽุฉูุŒ ูˆูŽุฅูู†ู‘ูŽู…ูŽุง ู„ุงูู…ู’ุฑูุฆู ู…ูŽุง ู†ูŽูˆูŽู‰ุŒ ููŽู…ูŽู†ู’ ูƒูŽุงู†ูŽุชู’ ู‡ูุฌู’ุฑูŽุชูู‡ู ุฅูู„ูŽู‰ ุงู„ู„ู‘ูŽู‡ู ูˆูŽุฑูŽุณููˆู„ูู‡ู ููŽู‡ูุฌู’ุฑูŽุชูู‡ู ุฅูู„ูŽู‰ ุงู„ู„ู‘ูŽู‡ู ูˆูŽุฑูŽุณููˆู„ูู‡ูุŒ ูˆูŽู…ูŽู†ู’ ูƒูŽุงู†ูŽุชู’ ู‡ูุฌู’ุฑูŽุชูู‡ู ุฅูู„ูŽู‰ ุฏูู†ู’ูŠูŽุง ูŠูุตููŠุจูู‡ูŽุง ุฃูŽูˆู ุงู…ู’ุฑูŽุฃูŽุฉู ูŠูŽุชูŽุฒูŽูˆู‘ูŽุฌูู‡ูŽุงุŒ ููŽู‡ูุฌู’ุฑูŽุชูู‡ู ุฅูู„ูŽู‰ ู…ูŽุง ู‡ูŽุงุฌูŽุฑูŽ ุฅูู„ูŽูŠู’ู‡ู โ€โ€.โ€

Narrated `Umar bin Al-Khattab: I heard Allah's Messenger (๏ทบ) saying, "The (reward of) deeds, depend upon the intentions and every person will get the reward according to what he has intended. So whoever emigrated for the sake of Allah and His Apostle, then his emigration will be considered to be for Allah and His Apostle, and whoever emigrated for the sake of worldly gain or for a woman to marry, then his emigration will be considered to be for what he emigrated for."

Sahih al-Bukhari 6689 https://sunnah.com/bukhari:6689

The isnad is this:

ุญูŽุฏู‘ูŽุซูŽู†ูŽุง ู‚ูุชูŽูŠู’ุจูŽุฉู ุจู’ู†ู ุณูŽุนููŠุฏูุŒ ุญูŽุฏู‘ูŽุซูŽู†ูŽุง ุนูŽุจู’ุฏู ุงู„ู’ูˆูŽู‡ู‘ูŽุงุจูุŒ ู‚ูŽุงู„ูŽ ุณูŽู…ูุนู’ุชู ูŠูŽุญู’ูŠูŽู‰ ุจู’ู†ูŽ ุณูŽุนููŠุฏูุŒ ูŠูŽู‚ููˆู„ู ุฃูŽุฎู’ุจูŽุฑูŽู†ููŠ ู…ูุญูŽู…ู‘ูŽุฏู ุจู’ู†ู ุฅูุจู’ุฑูŽุงู‡ููŠู…ูŽุŒ ุฃูŽู†ู‘ูŽู‡ู ุณูŽู…ูุนูŽ ุนูŽู„ู’ู‚ูŽู…ูŽุฉูŽ ุจู’ู†ูŽ ูˆูŽู‚ู‘ูŽุงุตู ุงู„ู„ู‘ูŽูŠู’ุซููŠู‘ูŽุŒ ูŠูŽู‚ููˆู„ู ุณูŽู…ูุนู’ุชู ุนูู…ูŽุฑูŽ ุจู’ู†ูŽ ุงู„ู’ุฎูŽุทู‘ูŽุงุจู

Qutaybah bin Saeed narrated from Abdul Wahhab who said he heard Yahya bin Saeed who said it was reported to him from Muhammad bin Ibrahim who heard Alqamah bin Waqqaa al Laythi, who said he heard from Umar bin Khattab.

The use of Ukhbert is to first identify the names narrators from the text, and then find out who they are.

Ukhbert Narrator Linking

This is the second half of the ukhbert pair. It links the narrator name to a ID from which the user can identify narrator information. For a comma separated list of names, where spaces between the names are replaced with '_', the model will output a text following this format: L{Narrator_ID}

for each token it identifies with.

Uses

The model is intended to be used to find narrator ids based on the text of the narrator names. Once the narrrator ID is retrieved, we recommending using this space to retrieve information about the identified narrator. It is strongly recomended to follow instructions on how to use the model in code, as this model is very delicate in how it predicts outputs.

Out-of-Scope Use

This model is strongly tied to the first half of ukhbert. It is recomended that they be used together. The utility functions that we recomend have not been generalized to work with other models, besides ours. Use with caution.

How to Get Started with the Model

The most successful way (likely the only successful way) to use this model is to tokenize the inputs, map the inputs to words, the predict the model, then aggregate them so that it maps to the proper names. We have thus far not found the HF pipeline to do this and get the outputs we want to. Therefore, do not use the pipeline for inference.

Furthermore, the data must be preprocessed with a extra step, unlike ukhbert narrrator detection. FOr your list of narrators, you must have them as a string, separated by ', ' where spaces between the parts of the name have underscores. Here is one example:

["ู‚ุชูŠุจุฉ_ุจู†_ุณุนูŠุฏุŒ ุนุจุฏ_ุงู„ูˆู‡ุงุจุŒ ูŠุญูŠู‰_ุจู†_ุณุนูŠุฏุŒ ู…ุญู…ุฏ_ุจู†_ุฅุจุฑุงู‡ูŠู…ุŒ ุนู„ู‚ู…ุฉ_ุจู†_ูˆู‚ุงุต_ุงู„ู„ูŠุซูŠุŒ ุนู…ุฑ_ุจู†_ุงู„ุฎุทุงุจ"]

You may have multiple lists of names, for multiple isnads. The list can be any number of elements.

We have compiled code that has been the only way so far for us to get intended results. This is because one of the original depenedencies of this model was the simpletransformers package, which is no longer maintained.

Requirements for working example

The code we use only requires the tranformers library and the httpimport library. httpimport has no additional dependencies. We use it to import the utility functions that we have found works.

Working example

This is a working example with the model and the utility code. We strongly recommend using it. We have not found alternative code. However, feel free to edit it as see fit, or make a pull request to the Gist code if there are better means.

import httpimport 
url= "https://gist.githubusercontent.com/FDSRashid/55c14a8e9ba360b640cfca0b612ccd9a/raw/6ae62db2d8aef3691b8e1260899aebf428ed1371"
with httpimport.remote_repo(url):
 # import module that contains utility function for predictions
  import narrator_link_utils
narrator_link_utils.predict_narrators( "HikmaLabs/ukhbert_narrator_linking", to_predict = ["ู‚ุชูŠุจุฉ_ุจู†_ุณุนูŠุฏุŒ ุนุจุฏ_ุงู„ูˆู‡ุงุจุŒ ูŠุญูŠู‰_ุจู†_ุณุนูŠุฏุŒ ู…ุญู…ุฏ_ุจู†_ุฅุจุฑุงู‡ูŠู…ุŒ ุนู„ู‚ู…ุฉ_ุจู†_ูˆู‚ุงุต_ุงู„ู„ูŠุซูŠุŒ ุนู…ุฑ_ุจู†_ุงู„ุฎุทุงุจ"])
[[{'ู‚ุชูŠุจุฉ_ุจู†_ุณุนูŠุฏุŒ': 'L6460'},
   {'ุนุจุฏ_ุงู„ูˆู‡ุงุจุŒ': 'L5280'},
   {'ูŠุญูŠู‰_ุจู†_ุณุนูŠุฏุŒ': 'L8272'},
   {'ู…ุญู…ุฏ_ุจู†_ุฅุจุฑุงู‡ูŠู…ุŒ': 'L6796'},
   {'ุนู„ู‚ู…ุฉ_ุจู†_ูˆู‚ุงุต_ุงู„ู„ูŠุซูŠุŒ': 'L5719'},
   {'ุนู…ุฑ_ุจู†_ุงู„ุฎุทุงุจ': 'L5913'}]],
 [[{'ู‚ุชูŠุจุฉ_ุจู†_ุณุนูŠุฏุŒ': [[np.float32(-2.8520968),
      np.float32(-2.7746296),
      np.float32(-2.0540233),
      np.float32(-1.2407739),
      np.float32(1.2954512),
      np.float32(-2.2372236),
      np.float32(-1.238617),
      np.float32(-2.3866634),
      np.float32(0.18253404),
      np.float32(-2.643002),
      np.float32(-2.656282),
      np.float32(-1.6206467),
      np.float32(-2.0487826),
      np.float32(-1.9577142),
      np.float32(-1.773613),
      np.float32(-0.9975122),
      np.float32(-0.38102615),
      np.float32(0.8699675),
      np.float32(-2.4253492),
...
      np.float32(-3.5996199),
      np.float32(-1.7653825),
      np.float32(-3.09321),
      np.float32(-3.2217002),
      ...]]}]])

Training

Ukhbert is finetuned from the ARABERT v0.1 model. For more details, please consult the referenced paper, "Learning to identify Narrators in Classical Islamic Texts"

Citing

If you use the model, please cite this paper:

@article{ALKAOUD2021335,
title = {Learning to Identify Narrators in Classical Arabic Texts},
journal = {Procedia Computer Science},
volume = {189},
pages = {335-342},
year = {2021},
note = {AI in Computational Linguistics},
issn = {1877-0509},
doi = {https://doi.org/10.1016/j.procs.2021.05.109},
url = {https://www.sciencedirect.com/science/article/pii/S1877050921012369},
author = {Mohamed Alkaoud and Mairaj Syed},
keywords = {NLP, Classical Arabic, Entity linking, Named-entity recognition, Digital humanities, Hadith science},
abstract = {One widespread historical method of transmitting and recording information about important events and people in the Middle East is the narration-based method. In this method, each saying about a person or event is transmitted from person to person until a systematic collector records and compiles such sayings in a stable collection. At each stage of transmission, the narrator not only transmits the saying but also the person he got it from going back to the earliest narrator. Identifying each narrator in these collections is important to better measure the accuracy of the narrations and identify the date and geographies of their circulation. In this work, we propose a natural language processing technique to automate the identification of narrators in classical Arabic texts. Our proposed technique consists of two models: 1) a model for detecting the narrators in the text, and 2) a model for linking narrators to their biographies. We train our two models on a large collection of annotated classical Arabic texts and achieve F1-scores of 96.15% and 95.74% for narration detection and linking respectively.}
}

Model Card Authors

Ferdaws Rashid

Model Card Contact

[email protected]

Downloads last month
14
Safetensors
Model size
215M params
Tensor type
F32
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for HikmaLabs/ukhbert_narrator_linking

Finetuned
(2)
this model