Model Datacard: Persian Keyword Extraction Model

Model Details

Model Name: keyword_Roberta_base_per
Base Model: xlm-roberta-large
Task: Keyword Extraction
Language: Persian (Farsi)
Developer: PakdamanAli
Model Version: 1.0.0

Intended Use

This model is designed to extract keywords from Persian text. It can be used for:

Automatic tagging of content
Search engine optimization
Content categorization
Topic modeling
Information retrieval enhancement

Primary Intended Uses

Content analysis for Persian websites
Academic research on Persian text
Information extraction systems

Out-of-Scope Use Cases

Translation services
Text summarization
Persian named entity recognition (unless specifically trained for this)
Other NLP tasks beyond keyword extraction

Training Data

Dataset Size: 40,000 Persian text samples
Data Preparation: Fine-tuned on xlm-roberta-large

Performance Evaluation

Metrics and evaluation results will be published in a future update.

Limitations

The model may not perform well on domain-specific content that was not represented in the training data
Performance may vary for very short or extremely long texts
The model may occasionally extract words that are not truly "key" to the content
Dialect variations in Persian might affect extraction quality

Ethical Considerations

The model is trained on Persian text and may reflect biases present in that content
Users should verify extracted keywords for sensitive content before implementing in automated systems
The model should not be used to extract or analyze personally identifiable information without proper consent

Technical Specifications

Input: Persian text (UTF-8 encoded)
Output: List of extracted keywords
Framework: Transformers (Hugging Face)
Requirements: PyTorch, Transformers

Pipeline Usage

To use this model with the Hugging Face pipeline:

from transformers import pipeline

# Initialize the pipeline
keyword_extractor = pipeline(
    task="token-classification",
    model="PakdamanAli/keyword_Roberta_base_per",
    tokenizer="PakdamanAli/keyword_Roberta_base_per"
)

# Example usage
text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
keywords = keyword_extractor(text)

# Process the results based on the model output format
# Example: extracted_keywords = [item["word"] for item in keywords]

Example

from transformers import pipeline

extractor = pipeline(
    task="token-classification",
    model="PakdamanAli/keyword_Roberta_base_per",
    tokenizer="PakdamanAli/keyword_Roberta_base_per"
)

text = "ایران کشوری با تاریخ و فرهنگ غنی است که دارای جاذبه‌های گردشگری فراوان می‌باشد."
results = extractor(text)

# Extract just the words from the results
keywords = [item["word"] for item in results]
print(keywords)