You can compute SPECS scores for an image–caption pair using the following code:
from PIL import Image
import torch
import torch.nn.functional as F
from model import longclip
# Device configuration
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load SPECS model
model, preprocess = longclip.load("spec.pt", device=device)
model.eval()
# Load image
image_path = "SPECS/images/cat.png"
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)
# Define text descriptions
texts = [
"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
"The cat is partially tucked under a multi-colored woven jumper.",
"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
"The cat is partially tucked under a multi-colored woven blanket.",
"A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
"The cat is partially tucked under a multi-colored woven blanket with fringed edges."
]
# Process inputs
text_tokens = longclip.tokenize(texts).to(device)
# Get features and calculate SPECS
with torch.no_grad():
image_features = model.encode_image(image)
text_features = model.encode_text(text_tokens)
# Calculate cosine similarity
similarity = F.cosine_similarity(image_features.unsqueeze(1), text_features.unsqueeze(0), dim=-1)
# SPECS
specs_scores = torch.clamp((similarity + 1.0) / 2.0, min=0.0)
# Output results
print("SPECS")
for i, score in enumerate(specs_scores.squeeze()):
print(f" Text {i+1}: {score:.4f}")
This shows that SPECS successfully assigns progressively higher scores to captions with more fine-grained and correct details:
Text 1: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven jumper."
→ Score: 0.4293Text 2: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket."
→ Score: 0.4457Text 3: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket with fringed edges."
→ Score: 0.4583
Model tree for XiaoFu666/SPECS
Base model
BeichenZhang/LongCLIP-B