metadata
license: apache-2.0
tags:
- vision
SigLIP 2 Base
SigLIP 2 extends the pretraining objective of SigLIP with prior, independently developed techniques into a unified recipe, for improved semantic understanding, localization, and dense features.
Intended uses
You can use the raw model for tasks like zero-shot image classification and image-text retrieval, or as a vision encoder for VLMs (and other vision tasks).
Training procedure
SigLIP 2 adds some clever training objectives on top of SigLIP:
- Decoder loss
- Global-local and masked prediction loss
- Aspect ratio and resolution adaptibility
Training data
SigLIP 2 is pre-trained on the WebLI dataset (Chen et al., 2023).
Compute
The model was trained on up to 2048 TPU-v5e chips.
Evaluation results
Evaluation of SigLIP 2 is shown below (taken from the paper).
BibTeX entry and citation info
TODO