BLIP-base fine-tuned for Image Captioning on High-Level descriptions of Rationales

BLIP base trained on the HL dataset for rationale generation of images

Model fine-tuning πŸ‹οΈβ€

  • Trained for of 6 epochs
  • lr: 5eβˆ’5
  • Adam optimizer
  • half-precision (fp16)

Test set metrics 🧾

| Cider  | SacreBLEU  | Rouge-L |
|--------|------------|---------|
| 46.11  |    6.21    |  19.74  |

Model in Action πŸš€

import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

processor = BlipProcessor.from_pretrained("michelecafagna26/blip-base-captioning-ft-hl-rationales")
model = BlipForConditionalGeneration.from_pretrained("michelecafagna26/blip-base-captioning-ft-hl-rationales").to("cuda")

img_url = 'https://datasets-server.huggingface.co/assets/michelecafagna26/hl/--/default/train/0/image/image.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')


inputs = processor(raw_image, return_tensors="pt").to("cuda")
pixel_values = inputs.pixel_values

generated_ids = model.generate(pixel_values=pixel_values, max_length=50,
            do_sample=True,
            top_k=120,
            top_p=0.9,
            early_stopping=True,
            num_return_sequences=1)

processor.batch_decode(generated_ids, skip_special_tokens=True)

>>> "she is on vacation."

BibTex and citation info

@inproceedings{cafagna2023hl,
  title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
{R}ationales},
  author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
  booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
address = {Prague, Czech Republic},
  year={2023}
}
Downloads last month
143
Safetensors
Model size
247M params
Tensor type
I64
Β·
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Dataset used to train michelecafagna26/blip-base-captioning-ft-hl-rationales