File size: 3,381 Bytes
9fe2d18 88f63da e8708ab 9fe2d18 88f63da 9fe2d18 e8708ab 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 88f63da 9fe2d18 e8708ab 9fe2d18 88f63da 9fe2d18 88f63da e8708ab 88f63da |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 |
---
license: apache-2.0
tags:
- vision
widget:
- src: >-
https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg
candidate_labels: bee in the sky, bee on the flower
example_title: Bee
library_name: transformers
pipeline_tag: zero-shot-image-classification
---
# SigLIP 2 So400m
[SigLIP 2](https://huggingface.co/papers/2502.14786) extends the pretraining objective of
[SigLIP](https://huggingface.co/papers/2303.15343) with prior, independently developed techniques
into a unified recipe, for improved semantic understanding, localization, and dense features.
## Intended uses
You can use the raw model for tasks like zero-shot image classification and
image-text retrieval, or as a vision encoder for VLMs (and other vision tasks).
Here is how to use this model to perform zero-shot image classification:
```python
from transformers import pipeline
# load pipeline
ckpt = "google/siglip2-so400m-patch16-256"
image_classifier = pipeline(model=ckpt, task="zero-shot-image-classification")
# load image and candidate labels
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
candidate_labels = ["2 cats", "a plane", "a remote"]
# run inference
outputs = image_classifier(image, candidate_labels)
print(outputs)
```
You can encode an image using the Vision Tower like so:
```python
import torch
from transformers import AutoModel, AutoProcessor
from transformers.image_utils import load_image
# load the model and processor
ckpt = "google/siglip2-so400m-patch16-256"
model = AutoModel.from_pretrained(ckpt, device_map="auto").eval()
processor = AutoProcessor.from_pretrained(ckpt)
# load the image
image = load_image("https://huggingface.co/datasets/merve/coco/resolve/main/val2017/000000000285.jpg")
inputs = processor(images=[image], return_tensors="pt").to(model.device)
# run infernece
with torch.no_grad():
image_embeddings = model.get_image_features(**inputs)
print(image_embeddings.shape)
```
For more code examples, we refer to the [siglip documentation](https://huggingface.co/transformers/main/model_doc/siglip.html#).
## Training procedure
SigLIP 2 adds some clever training objectives on top of SigLIP:
1. Decoder loss
2. Global-local and masked prediction loss
3. Aspect ratio and resolution adaptibility
### Training data
SigLIP 2 is pre-trained on the WebLI dataset [(Chen et al., 2023)](https://arxiv.org/abs/2209.06794).
### Compute
The model was trained on up to 2048 TPU-v5e chips.
## Evaluation results
Evaluation of SigLIP 2 is shown below (taken from the paper).

### BibTeX entry and citation info
```bibtex
@misc{tschannen2025siglip2multilingualvisionlanguage,
title={SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features},
author={Michael Tschannen and Alexey Gritsenko and Xiao Wang and Muhammad Ferjad Naeem and Ibrahim Alabdulmohsin and Nikhil Parthasarathy and Talfan Evans and Lucas Beyer and Ye Xia and Basil Mustafa and Olivier Hénaff and Jeremiah Harmsen and Andreas Steiner and Xiaohua Zhai},
year={2025},
eprint={2502.14786},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2502.14786},
}
```
|