Model Card for Image-Captioning-BLIP (Fine‑Tuned BLIP for Image Captioning)

This repository provides a lightweight, pragmatic fine‑tuning and evaluation pipeline around Salesforce BLIP for image captioning, with sane defaults and a tiny, production‑friendly inference helper. Use it to fine‑tune Salesforce/blip-image-captioning-base on Flickr8k or COCO‑Karpathy and export artifacts you can push to the Hugging Face Hub.

TL;DR: End‑to‑end train → evaluate → export → caption images with a few commands. Defaults: BLIP‑base (ViT‑B/16), Flickr8k, BLEU during training, COCO‑style metrics (CIDEr/METEOR/SPICE) after training.

Model Details

Model Description

This project fine‑tunes BLIP (Bootstrapping Language‑Image Pre-training) for the image‑to‑text task. BLIP couples a ViT visual encoder with a text decoder for conditional generation and uses a bootstrapped captioning strategy during pretraining in the original work. Here, we re‑use the open BlipForConditionalGeneration weights and processor and adapt them to caption everyday photographs from Flickr8k or the COCO Karpathy split.

Developed by: Amirhossein Yousefi
Shared by : Amirhossein Yousefi
Model type: Vision–language encoder–decoder (BLIP base; ViT‑B/16 vision encoder + text decoder)
Language(s) (NLP): English
License: BSD‑3‑Clause (inherits from the base model’s license; ensure your own dataset/weight licensing is compatible)
Finetuned from model : Salesforce/blip-image-captioning-base

Model Sources

Repository: https://github.com/amirhossein-yousefi/Image-Captioning-BLIP
Paper : BLIP — Bootstrapping Language‑Image Pre‑training (arXiv:2201.12086) https://arxiv.org/abs/2201.12086
Demo : See usage examples in the base model card on the Hub (PyTorch snippets)

Uses

Direct Use

Generate concise alt‑text‑style captions for photos.
Zero‑shot captioning with the base checkpoint, or improved fidelity after fine‑tuning on your target dataset.
Batch/offline captioning for indexing, search, and accessibility workflows.

Downstream Use

Warm‑start other captioners or retrieval models by using generated captions as weak labels.
Build dataset bootstrapping pipelines (e.g., pseudo‑labels for new domains).
Use as a component in multi‑modal applications (e.g., visual content tagging, basic scene summaries).

Out-of-Scope Use

High‑stakes or safety‑critical settings (medical, legal, surveillance).
Factual description of specialized imagery (e.g., diagrams, medical scans) without domain‑specific fine‑tuning.
Content moderation, protected‑attribute inference, or demographic classification.

Bias, Risks, and Limitations

Data bias: Flickr8k/COCO contain Western‑centric scenes and captions; captions may reflect annotator bias or stereotypes.
Language coverage: Training here targets English only; captions for non‑English content or localized entities may be poor.
Hallucination: Like most captioners, BLIP can produce plausible but incorrect or over‑confident statements.
Privacy: Avoid using on sensitive images or personally identifiable content without consent.
IP & license: Ensure you have rights to your training/evaluation images and that your dataset use complies with its license.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.

Evaluate on a domain‑specific validation set before deployment.
Use a safety filter/keyword blacklist or human review if captions are user‑facing.
For specialized domains, continue fine‑tuning with in‑domain images and style prompts.
When summarizing scenes, prefer beam search with moderate length penalties and enforce max lengths to curb rambling.

How to Get Started with the Model

Use the code below to get started with the model.

from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

# Replace with your fine-tuned repo once pushed, e.g. "amirhossein-yousefi/blip-captioning-flickr8k"
MODEL_ID = "Salesforce/blip-image-captioning-base"

processor = BlipProcessor.from_pretrained(MODEL_ID)
model = BlipForConditionalGeneration.from_pretrained(MODEL_ID)

image = Image.open("example.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=30, num_beams=5, length_penalty=1.0, early_stopping=True)
print(processor.decode(out[0], skip_special_tokens=True))

Training Details

Training Data

Two common options are wired in:

Flickr8k (ariG23498/flickr8k) — 8k images with 5 captions each. Default split in this repo: 90% train / 5% val / 5% test (deterministic by seed).
COCO‑Karpathy (yerevann/coco-karpathy) — community‑prepared Karpathy splits for COCO captions.

⚠️ Always verify dataset licenses and usage terms before training or publishing models derived from them.

Training Procedure

This project uses the Hugging Face Trainer with a custom collator; BlipProcessor handles both image and text preprocessing, and labels are padded to -100 for loss masking.

Preprocessing

Images and text are preprocessed by BlipProcessor consistent with BLIP defaults (resize/normalize/tokenize).
Optional vision encoder freezing is supported for parameter‑efficient fine‑tuning.

Training Hyperparameters (defaults)

Epochs: 4
Learning rate: 5e-5
Per‑device batch size: 8 (train & eval)
Gradient accumulation: 2
Gradient checkpointing: True
Freeze vision encoder: False (set True for low‑VRAM setups)
Logging: every 50 steps; keep 2 checkpoints
Model selection: best sacrebleu

Generation (eval/inference defaults)

max_txt_len = 40, gen_max_new_tokens = 30, num_beams = 5, length_penalty = 1.0, early_stopping = True

Speeds, Sizes, Times

Single 16 GB GPU is typically sufficient for BLIP‑base with the defaults (gradient checkpointing enabled).
If VRAM is tight: freeze the vision encoder, lower the batch size, and/or increase gradient accumulation.

Evaluation

Testing Data, Factors & Metrics

Data: Validation split of the chosen dataset (Flickr8k or COCO‑Karpathy).
Metrics: BLEU‑4 (during training), and post‑training COCO‑style metrics: CIDEr, METEOR, SPICE.
Notes: SPICE requires Java and can be slow; you can disable or subsample via config.

Results

After training, a compact JSON with COCO metrics is written to:

blip-open-out/coco_metrics.json

🏆 Results (Test Split)

Metric	Score
BLEU‑4	0.9708
METEOR	0.7888
CIDEr	9.3330
SPICE	—

Raw JSON

{
  "Bleu_4": 0.9707865195383757,
  "METEOR": 0.7887653835397767,
  "CIDEr": 9.332990983959254,
  "SPICE": null
}

---

Summary

Expect strongest results when fine‑tuning on in‑domain imagery and using beam search at inference time.

Model Examination

Inspect failure cases: cluttered scenes, occlusions, specialized objects, or images with embedded text.
Run qualitative sweeps by toggling beam size and length penalties to see style/verbosity changes.

Environmental Impact

Estimate using the ML CO2 Impact calculator. Fill the values you observe for your runs:

Hardware Type: (e.g., 1× NVIDIA T4 / A10 / A100)
Hours used: (e.g., 3.2 h for 4 epochs on Flickr8k)
Cloud Provider: (e.g., AWS on SageMaker, optional)
Compute Region: (e.g., us‑west‑2)
Carbon Emitted: (estimated grams of CO₂eq)

Technical Specifications

Model Architecture and Objective

Architecture: BLIP encoder–decoder; ViT‑B/16 vision backbone with a text decoder for conditional caption generation.
Objective: Cross‑entropy on tokenized captions with masked padding (-100), using the BLIP processor’s packing.

Compute Infrastructure

Hardware

Trains comfortably on one 16 GB GPU (defaults).

Software

Python 3.9+, PyTorch, Transformers, Datasets, evaluate, sacrebleu, optional pycocotools/pycocoevalcap (for CIDEr/METEOR/SPICE).
Optional AWS SageMaker entry points are included for managed training and inference.

Amirhossein75
/

Image-Captioning-Blip