Data Card for EuroBERT-210m-finetuned-imdb
Model Overview
- Model Name: EuroBERT-210m-finetuned-imdb
- Base Model: EuroBERT-210m
- Fine-tuned On: IMDb dataset
- Task: Masked Language Modeling (MLM)
- Training Objective: Minimize Perplexity
Dataset Details
- Dataset Used: IMDb
- Dataset Version: Default version from
datasets
library - Dataset Source: Hugging Face
datasets
- Training Split:
train
- Evaluation Split:
test
Training & Evaluation
Training Process
- The model was fine-tuned for three epochs using PyTorch and Hugging Face's
transformers
library. - The optimizer and learning rate scheduler were set up within the
accelerate
framework.
Evaluation Metrics
- The model was evaluated using Perplexity (PPL) on the test set.
- Results:
- Epoch 0: PPL = 12.63
- Epoch 1: PPL = 9.35
- Epoch 2: PPL = 8.12
Model Usage
Inference
The model can be used for masked token prediction using the following script:
import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer
def predict_masked_sentence(sentence, mask_token="<|mask|>"):
"""
Predicts top-1 tokens for all mask tokens in a sentence and returns the reconstructed text.
Args:
sentence (str): Input sentence with mask tokens (e.g., "The movie was [MASK]!").
mask_token (str, optional): Token used as mask in the input sentence. Defaults to "<|mask|>".
Returns:
str: Sentence with all mask tokens replaced by top-1 predictions.
"""
model_checkpoint = "milanvelinovski/EuroBERT-210m-finetuned-imdb"
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, trust_remote_code=True)
sentence_with_model_mask = sentence.replace(mask_token, "<|mask|>")
inputs = tokenizer(sentence_with_model_mask, return_tensors="pt")
token_logits = model(**inputs).logits
mask_token_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
top_tokens = [torch.topk(token_logits[0, idx, :], 1).indices.item() for idx in mask_token_indices]
text_parts = sentence.split(mask_token)
final_text = text_parts[0] + ''.join(tokenizer.decode([token]) + text_parts[i+1] for i, token in enumerate(top_tokens))
return final_text
text = "The protagonist's journey was <|mask|>, filled with <|mask|> obstacles that made the ending feel <|mask|>."
final_text = predict_masked_sentence(text)
print(final_text)
Libraries Used
Library | Version |
---|---|
datasets | 3.3.1 |
transformers | 4.49.0 |
evaluate | 0.4.3 |
accelerate | 1.2.1 |
torch | 2.5.1+cu121 |
Model Limitations
- The model is primarily trained for masked language modeling and may not generalize well to other NLP tasks.
- The perplexity scores indicate that further fine-tuning or hyperparameter optimization might improve performance.
- Model predictions are constrained by the IMDb dataset and may not generalize well to other domains.
Citation
If you use this model, please cite:
@misc{EuroBERT-210m-finetuned-imdb,
author = {Milan Velinovski},
title = {EuroBERT-210m-finetuned-imdb},
year = {2025},
publisher = {Hugging Face},
url = {https://huggingface.co/milanvelinovski/EuroBERT-210m-finetuned-imdb}
}
- Downloads last month
- 13
Model tree for milanvelinovski/EuroBERT-210m-finetuned-imdb
Base model
EuroBERT/EuroBERT-210m