You can compute SPECS scores for an image–caption pair using the following code:

from PIL import Image
import torch
import torch.nn.functional as F
from model import longclip

# Device configuration
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

# Load SPECS model
model, preprocess = longclip.load("spec.pt", device=device)
model.eval()

# Load image
image_path = "SPECS/images/cat.png"
image = preprocess(Image.open(image_path)).unsqueeze(0).to(device)

# Define text descriptions
texts = [
    "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
    "The cat is partially tucked under a multi-colored woven jumper.",
    "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
    "The cat is partially tucked under a multi-colored woven blanket.",
    "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. "
    "The cat is partially tucked under a multi-colored woven blanket with fringed edges."
]

# Process inputs
text_tokens = longclip.tokenize(texts).to(device)

# Get features and calculate SPECS
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text_tokens)
    
    # Calculate cosine similarity
    similarity = F.cosine_similarity(image_features.unsqueeze(1), text_features.unsqueeze(0), dim=-1)
    
    # SPECS
    specs_scores = torch.clamp((similarity + 1.0) / 2.0, min=0.0)

# Output results
print("SPECS")
for i, score in enumerate(specs_scores.squeeze()):
    print(f" Text {i+1}: {score:.4f}")

This shows that SPECS successfully assigns progressively higher scores to captions with more fine-grained and correct details:

  • Text 1: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven jumper."
    Score: 0.4293

  • Text 2: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket."
    Score: 0.4457

  • Text 3: "A British Shorthair cat with plush, bluish-gray fur is lounging on a deep green velvet sofa. The cat is partially tucked under a multi-colored woven blanket with fringed edges."
    Score: 0.4583

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for XiaoFu666/SPECS

Finetuned
(1)
this model

Dataset used to train XiaoFu666/SPECS