ViT patch size?

#1
by bradneuberg - opened

Thanks for this model! I assume it is using a Vision Transformer? If so what is the visual patch size of patches in the ViT (16x16, 32x32, etc)? In my own experiments I’ve found the ViT patch size to have a large role in how well small objects are retained in CLIP/SigLIP embeddings.

patch size = 16; Our Git-RSCLIP is based on the [google/siglip-large-patch16-256]

Thanks for this model! I assume it is using a Vision Transformer? If so what is the visual patch size of patches in the ViT (16x16, 32x32, etc)? In my own experiments I’ve found the ViT patch size to have a large role in how well small objects are retained in CLIP/SigLIP embeddings.

patch size = 16; Our Git-RSCLIP is based on the [google/siglip-large-patch16-256]

lcybuaa changed discussion status to closed
lcybuaa changed discussion status to open

Thanks!

I'm very interested in using your model to get embeddings for a given remote sensing image. I've previously been using RemoteCLIP which was trained on 100k image/text pairs, but your 10 million image/text pairs is really compelling.

Is there an embedding dimension size for your SigLIP encoder? Do you have any boilerplate on getting an embedding for a given input image?

Thanks!

I'm very interested in using your model to get embeddings for a given remote sensing image. I've previously been using RemoteCLIP which was trained on 100k image/text pairs, but your 10 million image/text pairs is really compelling.

Is there an embedding dimension size for your SigLIP encoder? Do you have any boilerplate on getting an embedding for a given input image?

You can see our updated model-card: use-git-rsclip-to-get-image-features

Thanks! Do you know the dimensionality of the resulting embeddings?

Sign up or log in to comment