MobileCLIP2: Improving Multi-Modal Reinforced Training

MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari.

This repository contains the CoCa checkpoint pretrained on DFN-2B dataset and fine-tuned on varying datasets.

Highlights

MobileCLIP2-S4 matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max.
MobileCLIP-S3/S4 are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines).
Our smallest variant MobileCLIP-S0 obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller.
MobileCLIP-S2 obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples.
MobileCLIP-B (LT) attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336.

Checkpoints

Model	# Seen Samples (B)	# Params (M) (img + txt)	Latency (ms) (img + txt)	IN-1k Zero-Shot Top-1 Acc. (%)	Avg. Perf. (%) on 38 datasets
MobileCLIP2-S0	13	11.4 + 42.4	1.5 + 1.6	71.5	59.7
MobileCLIP2-S2	13	35.7 + 63.4	3.6 + 3.3	77.2	64.1
MobileCLIP2-B	13	86.3 + 63.4	10.4 + 3.3	79.4	65.8
MobileCLIP2-S3	13	125.1 + 123.6	8.0 + 6.6	80.7	66.8
MobileCLIP2-L/14	13	304.3 + 123.6	57.9 + 6.6	81.9	67.8
MobileCLIP2-S4	13	321.6 + 123.6	19.6 + 6.6	81.9	67.5
MobileCLIP-S0	13	11.4 + 42.4	1.5 + 1.6	67.8	58.1
MobileCLIP-S1	13	21.5 + 63.4	2.5 + 3.3	72.6	61.3
MobileCLIP-S2	13	35.7 + 63.4	3.6 + 3.3	74.4	63.7
MobileCLIP-B	13	86.3 + 63.4	10.4 + 3.3	76.8	65.2
MobileCLIP-B (LT)	36	86.3 + 63.4	10.4 + 3.3	77.2	65.8
MobileCLIP-S3	13	125.1 + 123.6	8.0 + 6.6	78.3	66.3
MobileCLIP-L/14	13	304.3 + 123.6	57.9 + 6.6	79.5	66.9
MobileCLIP-S4	13	321.6 + 123.6	19.6 + 6.6	79.4	68.1

How to Use

First, download the desired checkpoint visiting one of the links in the table above, then click the Files and versions tab, and download the PyTorch checkpoint. For programmatic downloading, if you have huggingface_hub installed, you can also run:

hf download apple/mobileclip2_coca_dfn2b_s13b_<finetune-dataset>_context<length>

For models length with context lengths 128/256, copy config.json to src/open_clip/model_configs/coca_ViT-L-14-context$len.json and change the model name in below example to coca_ViT-L-14-context$len.

import torch
import open_clip
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms('coca_ViT-L-14', pretrained='/path/to/mobileclip2_coca.pt')
model.eval()

image = preprocess(Image.open("docs/fig_accuracy_latency.png").convert('RGB')).unsqueeze(0)

with torch.no_grad(), torch.cuda.amp.autocast():
    syn_text = model.generate(
        image,
        generation_type="top_p",
        top_p=0.9,
        fixed_output_length=True
    )[0]
    syn_text = open_clip.decode(syn_text).split("<end_of_text>")[0].split("<start_of_text>")[-1].split(".")[0].rstrip()
print("Caption:", syn_text)

apple
/

mobileclip2_coca_dfn2b_s13b_dci-extended_s12m_context77

MobileCLIP2: Improving Multi-Modal Reinforced Training

Highlights

Checkpoints

How to Use

Collection including apple/mobileclip2_coca_dfn2b_s13b_dci-extended_s12m_context77

MobileCLIP2