ViT patch size?
Thanks for this model! I assume it is using a Vision Transformer? If so what is the visual patch size of patches in the ViT (16x16, 32x32, etc)? In my own experiments I’ve found the ViT patch size to have a large role in how well small objects are retained in CLIP/SigLIP embeddings.
patch size = 16; Our Git-RSCLIP is based on the [google/siglip-large-patch16-256]
Thanks for this model! I assume it is using a Vision Transformer? If so what is the visual patch size of patches in the ViT (16x16, 32x32, etc)? In my own experiments I’ve found the ViT patch size to have a large role in how well small objects are retained in CLIP/SigLIP embeddings.
patch size = 16; Our Git-RSCLIP is based on the [google/siglip-large-patch16-256]
Thanks!
I'm very interested in using your model to get embeddings for a given remote sensing image. I've previously been using RemoteCLIP which was trained on 100k image/text pairs, but your 10 million image/text pairs is really compelling.
Is there an embedding dimension size for your SigLIP encoder? Do you have any boilerplate on getting an embedding for a given input image?
Thanks!
I'm very interested in using your model to get embeddings for a given remote sensing image. I've previously been using RemoteCLIP which was trained on 100k image/text pairs, but your 10 million image/text pairs is really compelling.
Is there an embedding dimension size for your SigLIP encoder? Do you have any boilerplate on getting an embedding for a given input image?
You can see our updated model-card: use-git-rsclip-to-get-image-features
Thanks! Do you know the dimensionality of the resulting embeddings?