Model Datacard: Persian Keyword Extraction Model

Model Details

  • Model Name: keyword_Roberta_large_per
  • Base Model: xlm-roberta-large
  • Task: Keyword Extraction
  • Language: Persian (Farsi)
  • Developer: PakdamanAli
  • Model Version: 1.0.0

Intended Use

This model is designed to extract keywords from Persian text. It can be used for:

  • Automatic tagging of content
  • Search engine optimization
  • Content categorization
  • Topic modeling
  • Information retrieval enhancement

Primary Intended Uses

  • Content analysis for Persian websites
  • Academic research on Persian text
  • Information extraction systems

Out-of-Scope Use Cases

  • Translation services
  • Text summarization
  • Persian named entity recognition (unless specifically trained for this)
  • Other NLP tasks beyond keyword extraction

Training Data

  • Dataset Size: 40,000 Persian text samples
  • Data Preparation: Fine-tuned on xlm-roberta-large

Performance Evaluation

Metrics and evaluation results will be published in a future update.

Limitations

  • The model may not perform well on domain-specific content that was not represented in the training data
  • Performance may vary for very short or extremely long texts
  • The model may occasionally extract words that are not truly "key" to the content
  • Dialect variations in Persian might affect extraction quality

Ethical Considerations

  • The model is trained on Persian text and may reflect biases present in that content
  • Users should verify extracted keywords for sensitive content before implementing in automated systems
  • The model should not be used to extract or analyze personally identifiable information without proper consent

Technical Specifications

  • Input: Persian text (UTF-8 encoded)
  • Output: List of extracted keywords
  • Framework: Transformers (Hugging Face)
  • Requirements: PyTorch, Transformers

Pipeline Usage

To use this model with the Hugging Face pipeline:

from transformers import pipeline

# Initialize the pipeline
keyword_extractor = pipeline(
    task="token-classification",
    model="PakdamanAli/keyword_Roberta_large_per",
    tokenizer="PakdamanAli/keyword_Roberta_large_per"
)

# Example usage
text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
keywords = keyword_extractor(text)

# Process the results based on the model output format
# Example: extracted_keywords = [item["word"] for item in keywords]

Example

from transformers import pipeline

extractor = pipeline(
    task="token-classification",
    model="PakdamanAli/keyword_Roberta_large_per",
    tokenizer="PakdamanAli/keyword_Roberta_large_per"
)

text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
results = extractor(text)

# Extract just the words from the results
keywords = [item["word"] for item in results]
print(keywords)
Downloads last month
42
Safetensors
Model size
559M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.