MobileCLIP2: Improving Multi-Modal Reinforced Training
MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari.
This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets.
Highlights
MobileCLIP2-S4
matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max.MobileCLIP-S3/S4
are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines).- Our smallest variant
MobileCLIP-S0
obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller. MobileCLIP-S2
obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples.MobileCLIP-B (LT)
attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336.
Checkpoints
Model | # Seen Samples (B) |
# Params (M) (img + txt) |
Latency (ms) (img + txt) |
IN-1k Zero-Shot Top-1 Acc. (%) |
Avg. Perf. (%) on 38 datasets |
---|---|---|---|---|---|
MobileCLIP2-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 71.5 | 59.7 |
MobileCLIP2-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 77.2 | 64.1 |
MobileCLIP2-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 79.4 | 65.8 |
MobileCLIP2-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 80.7 | 66.8 |
MobileCLIP2-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 81.9 | 67.8 |
MobileCLIP2-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 81.9 | 67.5 |
MobileCLIP-S0 | 13 | 11.4 + 42.4 | 1.5 + 1.6 | 67.8 | 58.1 |
MobileCLIP-S1 | 13 | 21.5 + 63.4 | 2.5 + 3.3 | 72.6 | 61.3 |
MobileCLIP-S2 | 13 | 35.7 + 63.4 | 3.6 + 3.3 | 74.4 | 63.7 |
MobileCLIP-B | 13 | 86.3 + 63.4 | 10.4 + 3.3 | 76.8 | 65.2 |
MobileCLIP-B (LT) | 36 | 86.3 + 63.4 | 10.4 + 3.3 | 77.2 | 65.8 |
MobileCLIP-S3 | 13 | 125.1 + 123.6 | 8.0 + 6.6 | 78.3 | 66.3 |
MobileCLIP-L/14 | 13 | 304.3 + 123.6 | 57.9 + 6.6 | 79.5 | 66.9 |
MobileCLIP-S4 | 13 | 321.6 + 123.6 | 19.6 + 6.6 | 79.4 | 68.1 |
How to Use
First, download the desired checkpoint visiting one of the links in the table above, then click the Files and versions
tab, and download the PyTorch checkpoint.
For programmatic downloading, if you have huggingface_hub
installed, you can also run:
hf download apple/mobileclip2_coca_dfn2b_s13b_<finetune-dataset>_context<length>
For models length with context lengths 128/256, copy config.json
to src/open_clip/model_configs/coca_ViT-L-14-context$len.json
and change the model name in below example to coca_ViT-L-14-context$len
.
import torch
import open_clip
from PIL import Image
model, _, preprocess = open_clip.create_model_and_transforms('coca_ViT-L-14', pretrained='/path/to/mobileclip2_coca.pt')
model.eval()
image = preprocess(Image.open("docs/fig_accuracy_latency.png").convert('RGB')).unsqueeze(0)
with torch.no_grad(), torch.cuda.amp.autocast():
syn_text = model.generate(
image,
generation_type="top_p",
top_p=0.9,
fixed_output_length=True
)[0]
syn_text = open_clip.decode(syn_text).split("<end_of_text>")[0].split("<start_of_text>")[-1].split(".")[0].rstrip()
print("Caption:", syn_text)
- Downloads last month
- 3