metadata

base_model:
  - BeichenZhang/LongCLIP-B
datasets:
  - Lin-Chen/ShareGPT4V
language:
  - en
license: apache-2.0
pipeline_tag: zero-shot-image-classification

SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation

This model is presented in the paper SPECS: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation. The official code repository is available at: https://github.com/mbzuai-nlp/SPECS.

Abstract

As interest grows in generating long, detailed image captions, standard evaluation metrics become increasingly unreliable. N-gram-based metrics though efficient, fail to capture semantic correctness. Representational Similarity (RS) metrics, designed to address this, initially saw limited use due to high computational costs, while today, despite advances in hardware, they remain unpopular due to low correlation to human judgments. Meanwhile, metrics based on large language models (LLMs) show strong correlation with human judgments, but remain too expensive for iterative use during model development. We introduce SPECS (Specificity-Enhanced CLIPScore), a reference-free RS metric tailored to long image captioning. SPECS modifies CLIP with a new objective that emphasizes specificity: rewarding correct details and penalizing incorrect ones. We show that SPECS matches the performance of open-source LLM-based metrics in correlation to human judgments, while being far more efficient. This makes it a practical alternative for iterative checkpoint evaluation during image captioning model development.

Usage

You can compute SPECS scores for an image–caption pair using the following code:

from PIL import Image
import torch
import torch.nn.functional as F
from model import longclip

# Device configuration
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load SPECS model
model, preprocess = longclip.load("spec.pt", device=device)
model.eval()

# Load image
image_path = "SPECS/images/cat.png"
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)

# Define text descriptions
texts = [
    "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
    "The cat is partially tucked under a multi-colored woven jumper.",
    "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
    "The cat is partially tucked under a multi-colored woven blanket.",
    "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
    "The cat is partially tucked under a multi-colored woven blanket with fringed edges."
]

# Process inputs
text_tokens = longclip.tokenize(texts).to(device)

# Get features and calculate SPECS
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text_tokens)
    
    # Calculate cosine similarity
    similarity = F.cosine_similarity(image_features.unsqueeze(1), text_features.unsqueeze(0), dim=-1)
    
    # SPECS
    specs_scores = torch.clamp((similarity + 1.0) / 2.0, min=0.0)

# Output results
print("SPECS")
for i, score in enumerate(specs_scores.squeeze()):
    print(f" Text {i+1}: {score:.4f}")

This shows that SPECS successfully assigns progressively higher scores to captions with more fine-grained and correct details:

Text 1: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven jumper."
→ Score: 0.4293
Text 2: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket."
→ Score: 0.4457
Text 3: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket with fringed edges."
→ Score: 0.4583

Citation

If you find our work helpful for your research, please consider giving a citation:

@misc{chen2025specs,
    title={{SPECS}: Specificity-Enhanced CLIP-Score for Long Image Caption Evaluation},
    author={Xiaofu Chen and Israfel Salazar and Yova Kementchedjhieva},
    year={2025},
    eprint={2509.03897},
    archivePrefix={arXiv},
    primaryClass={cs.CL},
    url={https://arxiv.org/abs/2509.03897},
}