Model Card for Image-Captioning-BLIP (Fine‑Tuned BLIP for Image Captioning)

This repository provides a lightweight, pragmatic fine‑tuning and evaluation pipeline around Salesforce BLIP for image captioning, with sane defaults and a tiny, production‑friendly inference helper. Use it to fine‑tune Salesforce/blip-image-captioning-base on Flickr8k or COCO‑Karpathy and export artifacts you can push to the Hugging Face Hub.

TL;DR: End‑to‑end train → evaluate → export → caption images with a few commands. Defaults: BLIP‑base (ViT‑B/16), Flickr8k, BLEU during training, COCO‑style metrics (CIDEr/METEOR/SPICE) after training.

Model Details

Model Description

This project fine‑tunes BLIP (Bootstrapping Language‑Image Pre-training) for the image‑to‑text task. BLIP couples a ViT visual encoder with a text decoder for conditional generation and uses a bootstrapped captioning strategy during pretraining in the original work. Here, we re‑use the open BlipForConditionalGeneration weights and processor and adapt them to caption everyday photographs from Flickr8k or the COCO Karpathy split.

  • Developed by: Amirhossein Yousefi
  • Shared by : Amirhossein Yousefi
  • Model type: Vision–language encoder–decoder (BLIP base; ViT‑B/16 vision encoder + text decoder)
  • Language(s) (NLP): English
  • License: BSD‑3‑Clause (inherits from the base model’s license; ensure your own dataset/weight licensing is compatible)
  • Finetuned from model : Salesforce/blip-image-captioning-base

Model Sources

Uses

Direct Use

  • Generate concise alt‑text‑style captions for photos.
  • Zero‑shot captioning with the base checkpoint, or improved fidelity after fine‑tuning on your target dataset.
  • Batch/offline captioning for indexing, search, and accessibility workflows.

Downstream Use

  • Warm‑start other captioners or retrieval models by using generated captions as weak labels.
  • Build dataset bootstrapping pipelines (e.g., pseudo‑labels for new domains).
  • Use as a component in multi‑modal applications (e.g., visual content tagging, basic scene summaries).

Out-of-Scope Use

  • High‑stakes or safety‑critical settings (medical, legal, surveillance).
  • Factual description of specialized imagery (e.g., diagrams, medical scans) without domain‑specific fine‑tuning.
  • Content moderation, protected‑attribute inference, or demographic classification.

Bias, Risks, and Limitations

  • Data bias: Flickr8k/COCO contain Western‑centric scenes and captions; captions may reflect annotator bias or stereotypes.
  • Language coverage: Training here targets English only; captions for non‑English content or localized entities may be poor.
  • Hallucination: Like most captioners, BLIP can produce plausible but incorrect or over‑confident statements.
  • Privacy: Avoid using on sensitive images or personally identifiable content without consent.
  • IP & license: Ensure you have rights to your training/evaluation images and that your dataset use complies with its license.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

  • Evaluate on a domain‑specific validation set before deployment.
  • Use a safety filter/keyword blacklist or human review if captions are user‑facing.
  • For specialized domains, continue fine‑tuning with in‑domain images and style prompts.
  • When summarizing scenes, prefer beam search with moderate length penalties and enforce max lengths to curb rambling.

How to Get Started with the Model

Use the code below to get started with the model.

from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

# Replace with your fine-tuned repo once pushed, e.g. "amirhossein-yousefi/blip-captioning-flickr8k"
MODEL_ID = "Salesforce/blip-image-captioning-base"

processor = BlipProcessor.from_pretrained(MODEL_ID)
model = BlipForConditionalGeneration.from_pretrained(MODEL_ID)

image = Image.open("example.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=30, num_beams=5, length_penalty=1.0, early_stopping=True)
print(processor.decode(out[0], skip_special_tokens=True))

Training Details

Training Data

Two common options are wired in:

  • Flickr8k (ariG23498/flickr8k) — 8k images with 5 captions each. Default split in this repo: 90% train / 5% val / 5% test (deterministic by seed).
  • COCO‑Karpathy (yerevann/coco-karpathy) — community‑prepared Karpathy splits for COCO captions.

⚠️ Always verify dataset licenses and usage terms before training or publishing models derived from them.

Training Procedure

This project uses the Hugging Face Trainer with a custom collator; BlipProcessor handles both image and text preprocessing, and labels are padded to -100 for loss masking.

Preprocessing

  • Images and text are preprocessed by BlipProcessor consistent with BLIP defaults (resize/normalize/tokenize).
  • Optional vision encoder freezing is supported for parameter‑efficient fine‑tuning.

Training Hyperparameters (defaults)

  • Epochs: 4
  • Learning rate: 5e-5
  • Per‑device batch size: 8 (train & eval)
  • Gradient accumulation: 2
  • Gradient checkpointing: True
  • Freeze vision encoder: False (set True for low‑VRAM setups)
  • Logging: every 50 steps; keep 2 checkpoints
  • Model selection: best sacrebleu

Generation (eval/inference defaults)

  • max_txt_len = 40, gen_max_new_tokens = 30, num_beams = 5, length_penalty = 1.0, early_stopping = True

Speeds, Sizes, Times

  • Single 16 GB GPU is typically sufficient for BLIP‑base with the defaults (gradient checkpointing enabled).
  • If VRAM is tight: freeze the vision encoder, lower the batch size, and/or increase gradient accumulation.

Evaluation

Testing Data, Factors & Metrics

  • Data: Validation split of the chosen dataset (Flickr8k or COCO‑Karpathy).
  • Metrics: BLEU‑4 (during training), and post‑training COCO‑style metrics: CIDEr, METEOR, SPICE.
  • Notes: SPICE requires Java and can be slow; you can disable or subsample via config.

Results

After training, a compact JSON with COCO metrics is written to:

blip-open-out/coco_metrics.json

🏆 Results (Test Split)

BLEU4 METEOR CIDEr SPICE

Metric Score
BLEU‑4 0.9708
METEOR 0.7888
CIDEr 9.3330
SPICE
Raw JSON
{
  "Bleu_4": 0.9707865195383757,
  "METEOR": 0.7887653835397767,
  "CIDEr": 9.332990983959254,
  "SPICE": null
}
---

Summary

  • Expect strongest results when fine‑tuning on in‑domain imagery and using beam search at inference time.

Model Examination

  • Inspect failure cases: cluttered scenes, occlusions, specialized objects, or images with embedded text.
  • Run qualitative sweeps by toggling beam size and length penalties to see style/verbosity changes.

Environmental Impact

Estimate using the ML CO2 Impact calculator. Fill the values you observe for your runs:

  • Hardware Type: (e.g., 1× NVIDIA T4 / A10 / A100)
  • Hours used: (e.g., 3.2 h for 4 epochs on Flickr8k)
  • Cloud Provider: (e.g., AWS on SageMaker, optional)
  • Compute Region: (e.g., us‑west‑2)
  • Carbon Emitted: (estimated grams of CO₂eq)

Technical Specifications

Model Architecture and Objective

  • Architecture: BLIP encoder–decoder; ViT‑B/16 vision backbone with a text decoder for conditional caption generation.
  • Objective: Cross‑entropy on tokenized captions with masked padding (-100), using the BLIP processor’s packing.

Compute Infrastructure

Hardware

  • Trains comfortably on one 16 GB GPU (defaults).

Software

  • Python 3.9+, PyTorch, Transformers, Datasets, evaluate, sacrebleu, optional pycocotools/pycocoevalcap (for CIDEr/METEOR/SPICE).
  • Optional AWS SageMaker entry points are included for managed training and inference.
Downloads last month
7
Safetensors
Model size
247M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Amirhossein75/Image-Captioning-Blip

Finetuned
(41)
this model

Datasets used to train Amirhossein75/Image-Captioning-Blip