Model Card for Image-Captioning-BLIP (Fine‑Tuned BLIP for Image Captioning)
This repository provides a lightweight, pragmatic fine‑tuning and evaluation pipeline around Salesforce BLIP for image captioning, with sane defaults and a tiny, production‑friendly inference helper. Use it to fine‑tune Salesforce/blip-image-captioning-base
on Flickr8k or COCO‑Karpathy and export artifacts you can push to the Hugging Face Hub.
TL;DR: End‑to‑end train → evaluate → export → caption images with a few commands. Defaults: BLIP‑base (ViT‑B/16), Flickr8k, BLEU during training, COCO‑style metrics (CIDEr/METEOR/SPICE) after training.
Model Details
Model Description
This project fine‑tunes BLIP (Bootstrapping Language‑Image Pre-training) for the image‑to‑text task. BLIP couples a ViT visual encoder with a text decoder for conditional generation and uses a bootstrapped captioning strategy during pretraining in the original work. Here, we re‑use the open BlipForConditionalGeneration
weights and processor and adapt them to caption everyday photographs from Flickr8k or the COCO Karpathy split.
- Developed by: Amirhossein Yousefi
- Shared by : Amirhossein Yousefi
- Model type: Vision–language encoder–decoder (BLIP base; ViT‑B/16 vision encoder + text decoder)
- Language(s) (NLP): English
- License: BSD‑3‑Clause (inherits from the base model’s license; ensure your own dataset/weight licensing is compatible)
- Finetuned from model :
Salesforce/blip-image-captioning-base
Model Sources
- Repository: https://github.com/amirhossein-yousefi/Image-Captioning-BLIP
- Paper : BLIP — Bootstrapping Language‑Image Pre‑training (arXiv:2201.12086) https://arxiv.org/abs/2201.12086
- Demo : See usage examples in the base model card on the Hub (PyTorch snippets)
Uses
Direct Use
- Generate concise alt‑text‑style captions for photos.
- Zero‑shot captioning with the base checkpoint, or improved fidelity after fine‑tuning on your target dataset.
- Batch/offline captioning for indexing, search, and accessibility workflows.
Downstream Use
- Warm‑start other captioners or retrieval models by using generated captions as weak labels.
- Build dataset bootstrapping pipelines (e.g., pseudo‑labels for new domains).
- Use as a component in multi‑modal applications (e.g., visual content tagging, basic scene summaries).
Out-of-Scope Use
- High‑stakes or safety‑critical settings (medical, legal, surveillance).
- Factual description of specialized imagery (e.g., diagrams, medical scans) without domain‑specific fine‑tuning.
- Content moderation, protected‑attribute inference, or demographic classification.
Bias, Risks, and Limitations
- Data bias: Flickr8k/COCO contain Western‑centric scenes and captions; captions may reflect annotator bias or stereotypes.
- Language coverage: Training here targets English only; captions for non‑English content or localized entities may be poor.
- Hallucination: Like most captioners, BLIP can produce plausible but incorrect or over‑confident statements.
- Privacy: Avoid using on sensitive images or personally identifiable content without consent.
- IP & license: Ensure you have rights to your training/evaluation images and that your dataset use complies with its license.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
- Evaluate on a domain‑specific validation set before deployment.
- Use a safety filter/keyword blacklist or human review if captions are user‑facing.
- For specialized domains, continue fine‑tuning with in‑domain images and style prompts.
- When summarizing scenes, prefer beam search with moderate length penalties and enforce max lengths to curb rambling.
How to Get Started with the Model
Use the code below to get started with the model.
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
# Replace with your fine-tuned repo once pushed, e.g. "amirhossein-yousefi/blip-captioning-flickr8k"
MODEL_ID = "Salesforce/blip-image-captioning-base"
processor = BlipProcessor.from_pretrained(MODEL_ID)
model = BlipForConditionalGeneration.from_pretrained(MODEL_ID)
image = Image.open("example.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt")
out = model.generate(**inputs, max_new_tokens=30, num_beams=5, length_penalty=1.0, early_stopping=True)
print(processor.decode(out[0], skip_special_tokens=True))
Training Details
Training Data
Two common options are wired in:
- Flickr8k (
ariG23498/flickr8k
) — 8k images with 5 captions each. Default split in this repo: 90% train / 5% val / 5% test (deterministic by seed). - COCO‑Karpathy (
yerevann/coco-karpathy
) — community‑prepared Karpathy splits for COCO captions.
⚠️ Always verify dataset licenses and usage terms before training or publishing models derived from them.
Training Procedure
This project uses the Hugging Face Trainer with a custom collator; BlipProcessor
handles both image and text preprocessing, and labels are padded to -100
for loss masking.
Preprocessing
- Images and text are preprocessed by
BlipProcessor
consistent with BLIP defaults (resize/normalize/tokenize). - Optional vision encoder freezing is supported for parameter‑efficient fine‑tuning.
Training Hyperparameters (defaults)
- Epochs:
4
- Learning rate:
5e-5
- Per‑device batch size:
8
(train & eval) - Gradient accumulation:
2
- Gradient checkpointing:
True
- Freeze vision encoder:
False
(setTrue
for low‑VRAM setups) - Logging: every
50
steps; keep2
checkpoints - Model selection: best
sacrebleu
Generation (eval/inference defaults)
max_txt_len = 40
,gen_max_new_tokens = 30
,num_beams = 5
,length_penalty = 1.0
,early_stopping = True
Speeds, Sizes, Times
- Single 16 GB GPU is typically sufficient for BLIP‑base with the defaults (gradient checkpointing enabled).
- If VRAM is tight: freeze the vision encoder, lower the batch size, and/or increase gradient accumulation.
Evaluation
Testing Data, Factors & Metrics
- Data: Validation split of the chosen dataset (Flickr8k or COCO‑Karpathy).
- Metrics: BLEU‑4 (during training), and post‑training COCO‑style metrics: CIDEr, METEOR, SPICE.
- Notes: SPICE requires Java and can be slow; you can disable or subsample via config.
Results
After training, a compact JSON with COCO metrics is written to:
blip-open-out/coco_metrics.json
🏆 Results (Test Split)
Metric | Score |
---|---|
BLEU‑4 | 0.9708 |
METEOR | 0.7888 |
CIDEr | 9.3330 |
SPICE | — |
Raw JSON
{
"Bleu_4": 0.9707865195383757,
"METEOR": 0.7887653835397767,
"CIDEr": 9.332990983959254,
"SPICE": null
}
Summary
- Expect strongest results when fine‑tuning on in‑domain imagery and using beam search at inference time.
Model Examination
- Inspect failure cases: cluttered scenes, occlusions, specialized objects, or images with embedded text.
- Run qualitative sweeps by toggling beam size and length penalties to see style/verbosity changes.
Environmental Impact
Estimate using the ML CO2 Impact calculator. Fill the values you observe for your runs:
- Hardware Type: (e.g., 1× NVIDIA T4 / A10 / A100)
- Hours used: (e.g., 3.2 h for 4 epochs on Flickr8k)
- Cloud Provider: (e.g., AWS on SageMaker, optional)
- Compute Region: (e.g., us‑west‑2)
- Carbon Emitted: (estimated grams of CO₂eq)
Technical Specifications
Model Architecture and Objective
- Architecture: BLIP encoder–decoder; ViT‑B/16 vision backbone with a text decoder for conditional caption generation.
- Objective: Cross‑entropy on tokenized captions with masked padding (
-100
), using the BLIP processor’s packing.
Compute Infrastructure
Hardware
- Trains comfortably on one 16 GB GPU (defaults).
Software
- Python 3.9+, PyTorch, Transformers, Datasets, evaluate, sacrebleu, optional pycocotools/pycocoevalcap (for CIDEr/METEOR/SPICE).
- Optional AWS SageMaker entry points are included for managed training and inference.
- Downloads last month
- 7
Model tree for Amirhossein75/Image-Captioning-Blip
Base model
Salesforce/blip-image-captioning-base