mBART Fine-tuned on OpenSubtitles
This repository contains a fine-tuned version of the mBART model, optimized for translation and text generation on the OpenSubtitles dataset. Fine-tuning was performed on Farsi (fa) language data, making this model particularly effective for Farsi subtitling and conversational tasks. It leverages the strengths of the facebook/mbart-large-50 base model.
Table of Contents
- Overview
- Model Details
- Repository Structure
- Installation
- Usage
- Fine-Tuning and Training
- Performance
- Contributing
- License
Overview
The mBART model, originally pre-trained on a large corpus of multilingual data, has been fine-tuned on the OpenSubtitles dataset with a focus on Farsi. This fine-tuning enhances its ability to generate fluent, contextually relevant text in subtitle and conversational formats.
Model Details
- Model Name: mbart-finetuned-opensubtitle
- Base Model: facebook/mbart-large-50
- Fine-Tuning Dataset: OpenSubtitles
- Target Language: Farsi (fa)
- Intended Tasks: Translation, text generation, and conversational language processing
Repository Structure
- .gitattributes: Git attributes configuration.
- README.md: This documentation file.
- config.json: Configuration file with model architecture details and hyperparameters.
- generation_config.json: Specifies decoding and generation parameters.
- model.bin: Binary file containing the model weights (handled via Git LFS).
- sentencepiece.bpe.model: SentencePiece model used for tokenization (handled via Git LFS).
- special_tokens_map.json: Mapping of special tokens for the tokenizer.
- tokenizer.json: Tokenizer data file (handled via Git LFS).
- tokenizer_config.json: Configuration file for the tokenizer.
Installation
Before cloning this repository, ensure that you have Git LFS installed, as several large files (e.g., model weights and tokenizer files) are managed through Git LFS.
Usage
Below is an example of how to load and use the model with the Hugging Face Transformers library:
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
# Load tokenizer and model
tokenizer = MBart50TokenizerFast.from_pretrained("ghaskari/mbart-finetuned-opensubtitle")
model = MBartForConditionalGeneration.from_pretrained("ghaskari/mbart-finetuned-opensubtitle")
# Example input text for translation or text generation
input_text = "سلام، حال شما چطور است؟" # "Hello, how are you?" in Farsi
encoded_input = tokenizer(input_text, return_tensors="pt")
# Generate output (e.g., for translation or creative text generation)
output_ids = model.generate(**encoded_input)
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print("Generated Output:", output_text)
Fine-Tuning and Training
The fine-tuning process utilized the OpenSubtitles dataset to adapt the mBART model for Farsi conversational and subtitle-style text. Although the training scripts are not provided in this repository, a typical fine-tuning workflow includes:
- Data Preparation: Preprocessing and cleaning the OpenSubtitles dataset.
- Training Configuration: Setting up learning rates, batch sizes, and decoding strategies (e.g., beam search or sampling).
- Hardware Requirements: Using multi-GPU setups to speed up training.
Feel free to explore or modify your own fine-tuning scripts based on these guidelines.
Performance
While specific evaluation metrics (e.g., BLEU or ROUGE scores) are not provided here, the model has been qualitatively assessed for its ability to generate coherent and contextually accurate outputs in Farsi. Users are encouraged to benchmark the model on their own datasets and share feedback for further improvements.
Contributing
Contributions are welcome! If you have suggestions, improvements, or bug fixes, please submit a pull request or open an issue. For major changes, please discuss them via an issue first.
License
This repository is distributed under the MIT License. See the LICENSE file for more details.
This version incorporates the additional YAML metadata for the license, target language, and base model, along with updated details throughout the documentation.
- Downloads last month
- 5
Model tree for ghaskari/mbart-finetuned-opensubtitle
Base model
facebook/mbart-large-50