Data Card for EuroBERT-210m-finetuned-imdb

Model Overview

  • Model Name: EuroBERT-210m-finetuned-imdb
  • Base Model: EuroBERT-210m
  • Fine-tuned On: IMDb dataset
  • Task: Masked Language Modeling (MLM)
  • Training Objective: Minimize Perplexity

Dataset Details

  • Dataset Used: IMDb
  • Dataset Version: Default version from datasets library
  • Dataset Source: Hugging Face datasets
  • Training Split: train
  • Evaluation Split: test

Training & Evaluation

Training Process

  • The model was fine-tuned for three epochs using PyTorch and Hugging Face's transformers library.
  • The optimizer and learning rate scheduler were set up within the accelerate framework.

Evaluation Metrics

  • The model was evaluated using Perplexity (PPL) on the test set.
  • Results:
    • Epoch 0: PPL = 12.63
    • Epoch 1: PPL = 9.35
    • Epoch 2: PPL = 8.12

Model Usage

Inference

The model can be used for masked token prediction using the following script:

import torch
from transformers import AutoModelForMaskedLM, AutoTokenizer

def predict_masked_sentence(sentence, mask_token="<|mask|>"):
    """
    Predicts top-1 tokens for all mask tokens in a sentence and returns the reconstructed text.
    
    Args:
        sentence (str): Input sentence with mask tokens (e.g., "The movie was [MASK]!").
        mask_token (str, optional): Token used as mask in the input sentence. Defaults to "<|mask|>".
    
    Returns:
        str: Sentence with all mask tokens replaced by top-1 predictions.
    """
    model_checkpoint = "milanvelinovski/EuroBERT-210m-finetuned-imdb"
    model = AutoModelForMaskedLM.from_pretrained(model_checkpoint, trust_remote_code=True)
    tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, trust_remote_code=True)

    sentence_with_model_mask = sentence.replace(mask_token, "<|mask|>")
    inputs = tokenizer(sentence_with_model_mask, return_tensors="pt")
    token_logits = model(**inputs).logits

    mask_token_indices = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]
    top_tokens = [torch.topk(token_logits[0, idx, :], 1).indices.item() for idx in mask_token_indices]
    
    text_parts = sentence.split(mask_token)
    final_text = text_parts[0] + ''.join(tokenizer.decode([token]) + text_parts[i+1] for i, token in enumerate(top_tokens))
    
    return final_text

text = "The protagonist's journey was <|mask|>, filled with <|mask|> obstacles that made the ending feel <|mask|>."
final_text = predict_masked_sentence(text)
print(final_text)

Libraries Used

Library Version
datasets 3.3.1
transformers 4.49.0
evaluate 0.4.3
accelerate 1.2.1
torch 2.5.1+cu121

Model Limitations

  • The model is primarily trained for masked language modeling and may not generalize well to other NLP tasks.
  • The perplexity scores indicate that further fine-tuning or hyperparameter optimization might improve performance.
  • Model predictions are constrained by the IMDb dataset and may not generalize well to other domains.

Citation

If you use this model, please cite:

@misc{EuroBERT-210m-finetuned-imdb,
  author = {Milan Velinovski},
  title = {EuroBERT-210m-finetuned-imdb},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/milanvelinovski/EuroBERT-210m-finetuned-imdb}
}
Downloads last month
13
Safetensors
Model size
310M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for milanvelinovski/EuroBERT-210m-finetuned-imdb

Finetuned
(13)
this model

Dataset used to train milanvelinovski/EuroBERT-210m-finetuned-imdb