Data Card for EuroBERT-210m-finetuned-imdb

Model Overview

Model Name: EuroBERT-210m-finetuned-imdb
Base Model: EuroBERT-210m
Fine-tuned On: IMDb dataset
Task: Masked Language Modeling (MLM)
Training Objective: Minimize Perplexity

Dataset Details

Dataset Used: IMDb
Dataset Version: Default version from datasets library
Dataset Source: Hugging Face datasets
Training Split: train
Evaluation Split: test

Training & Evaluation

Training Process

The model was fine-tuned for three epochs using PyTorch and Hugging Face's transformers library.
The optimizer and learning rate scheduler were set up within the accelerate framework.

Evaluation Metrics

The model was evaluated using Perplexity (PPL) on the test set.
Results:
- Epoch 0: PPL = 12.63
- Epoch 1: PPL = 9.35
- Epoch 2: PPL = 8.12

Model Usage

Inference

The model can be used for masked token prediction using the following script:

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

def predict_masked_sentence(sentence, mask_token="<|mask|>"):
    """
    Predicts top-1 tokens for all mask tokens in a sentence and returns the reconstructed text.
    
    Args:
        sentence (str): Input sentence with mask tokens (e.g., "The movie was [MASK]!").
        mask_token (str, optional): Token used as mask in the input sentence. Defaults to "<|mask|>".
    
    Returns:
        str: Sentence with all mask tokens replaced by top-1 predictions.
    """
    model_checkpoint = "milanvelinovski/EuroBERT-210m-finetuned-imdb"
    model = AutoModelForMaskedLM.from_pretrained(model_checkpoint, trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, trust_remote_code=True)

    sentence_with_model_mask = sentence.replace(mask_token, "<|mask|>")
    inputs = tokenizer(sentence_with_model_mask, return_tensors="pt")
    token_logits = model(**inputs).logits

    mask_token_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
    top_tokens = [torch.topk(token_logits[0, idx, :], 1).indices.item() for idx in mask_token_indices]
    
    text_parts = sentence.split(mask_token)
    final_text = text_parts[0] + ''.join(tokenizer.decode([token]) + text_parts[i+1] for i, token in enumerate(top_tokens))
    
    return final_text

text = "The protagonist's journey was <|mask|>, filled with <|mask|> obstacles that made the ending feel <|mask|>."
final_text = predict_masked_sentence(text)
print(final_text)

Libraries Used

Library	Version
datasets	3.3.1
transformers	4.49.0
evaluate	0.4.3
accelerate	1.2.1
torch	2.5.1+cu121

Model Limitations

The model is primarily trained for masked language modeling and may not generalize well to other NLP tasks.
The perplexity scores indicate that further fine-tuning or hyperparameter optimization might improve performance.
Model predictions are constrained by the IMDb dataset and may not generalize well to other domains.

Citation

If you use this model, please cite:

@misc{EuroBERT-210m-finetuned-imdb,
  author = {Milan Velinovski},
  title = {EuroBERT-210m-finetuned-imdb},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/milanvelinovski/EuroBERT-210m-finetuned-imdb}
}

milanvelinovski
/

EuroBERT-210m-finetuned-imdb

Data Card for EuroBERT-210m-finetuned-imdb

Model Overview

Dataset Details

Training & Evaluation

Training Process

Evaluation Metrics

Model Usage

Inference

Libraries Used

Model Limitations

Citation

Model tree for milanvelinovski/EuroBERT-210m-finetuned-imdb

Dataset used to train milanvelinovski/EuroBERT-210m-finetuned-imdb