DermLIP + GPT-2 Dermatology Captioner

A dermatology image captioning model combining DermLIP vision encoder with gpt2-medium language model. Trained on dermatological images for generating clinical descriptions of skin lesions.

Architecture: DermLIP (ViT-B/16) → learnable prefix → GPT-2 (gpt2-medium). Trained in two stages: Stage A (META) for generalization and Stage B (SkinCAP) for style/terminology.

Metrics

Stage A (META)
val_loss=1.1070 • PPL=3.03
BLEU=38.6 • ROUGE-L=0.550 • CIDEr-D=0.17 • CLIP=24.4 • BERT_F1=0.565

Stage B (SKINCAP)
val_loss=1.1903 • PPL=3.29
BLEU=10.0 • ROUGE-L=0.278 • CIDEr-D=0.13 • CLIP=25.9 • BERT_F1=0.363

Inference

Minimal example uses inference_min.py included in this repo.
Requires: pip install torch transformers open_clip_torch pillow huggingface_hub

from huggingface_hub import snapshot_download
from inference_min import load_model, generate

# 1) download repo snapshot
repo_dir = snapshot_download("moxeeeem/dermlip-gpt2-captioner", allow_patterns=["*.pt","*.json","inference_min.py"])

# 2) load model from saved config/weights
model = load_model(repo_dir)  # builds CLIP backend + GPT-2 + prefix projector

# 3) run generation
img_paths = ["/path/to/derma_image.jpg"]  # local test images
caps = generate(model, img_paths, prompt="Describe the skin lesion concisely (morphology, color, scale, border, location) in one sentence.Conclude with the most likely diagnosis (1\u20133 words).")
for c in caps:
    print(c)

Files

File Size Check
best_stageA.pt 2 GB sha256[:12]=3219636f48b0
best_stageB.pt 2 GB sha256[:12]=69bded2dcad1
final_captioner_gpt2-medium_VisionTransformer.json 849 B sha256[:12]=e157402c9fe2
final_captioner_gpt2-medium_VisionTransformer.pt 2 GB sha256[:12]=536ae07811c9
loss_dermlip_vitb16.png 110 KB sha256[:12]=a04b1e5832d9

Details

  • Vision Encoder: DermLIP (ViT-B/16)
  • Language Model: GPT-2 (gpt2-medium)
  • CLIP weights: hf-hub:redlessone/DermLIP_ViT-B-16
  • Prefix tokens: 32
  • Training prompt: Describe the skin lesion concisely (morphology, color, scale, border, location) in one sentence.Conclude with the most likely diagnosis (1–3 words).

Model Type Detection

  • Detected as: dermlip
  • Repository: moxeeeem/dermlip-gpt2-captioner

Auto-generated on 2025-08-30 09:25 UTC.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support