license: apache-2.0 | |
tags: | |
- vision | |
# SigLIP 2 Base | |
[SigLIP 2](https://huggingface.co/collections/google/siglip2-67b5dcef38c175486e240107) | |
extends the pretraining objective of | |
[SigLIP](https://huggingface.co/collections/google/siglip-659d5e62f0ae1a57ae0e83ba) | |
with prior, independently developed techniques into a unified recipe, for improved semantic | |
understanding, localization, and dense features. | |
## Intended uses | |
You can use the raw model for tasks like zero-shot image classification and | |
image-text retrieval, or as a vision encoder for VLMs (and other vision tasks). | |
## Training procedure | |
SigLIP 2 adds some clever training objectives on top of SigLIP: | |
1. Decoder loss | |
2. Global-local and masked prediction loss | |
3. Aspect ratio and resolution adaptibility | |
### Training data | |
SigLIP 2 is pre-trained on the WebLI dataset [(Chen et al., 2023)](https://arxiv.org/abs/2209.06794). | |
### Compute | |
The model was trained on up to 2048 TPU-v5e chips. | |
## Evaluation results | |
Evaluation of SigLIP 2 is shown below (taken from the paper). | |
[Evaluation Table](TODO) | |
### BibTeX entry and citation info | |
```bibtex | |
TODO | |
``` | |