CLIP ViT-Large finetuned with Remote Sensing data

This is a CLIP model of size Large that finetuned with Remote Sensing data, namely using image captioning datasets.

The original weights are from openai/clip-vit-large-patch14.

Model Details

Model Description

This model was finetuned with the typical contrastive learning objective to train CLIP models.

Training Data:

Dataset #Images Image Size Spatial Resolution #Total Captions
NWPU-Captions 31,500 256 × 256 ∼30-0.2m 157,500
RSICD 10,921 224 × 224 different resolutions 54,605
Sydney-Captions 613 500 × 500 0.5m 3,065
UCM-Captions 2,100 256 × 256 ∼0.3m 10,500
Cap-4 45,134 224 × 224 different resolutions 225,670

Bias, Risks, and Limitations

The model may inherit the biases and limitations from the original CLIP weights.

While being trained with Remote Sensing data, it can still be limited to downstream applications of this domain, due to the nature of the training data (images of low resolution, descriptions that are too short, can contain mistakes, or are overly generic.)

How to Get Started with the Model

Your typical huggingface interface for CLIP Models: https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel

Citation

This model was developed with regards to the following work. If useful, please cite us at:

@article{silva2024large,
  title={Large language models for captioning and retrieving remote sensing images},
  author={Silva, Jo{\~a}o Daniel and Magalh{\~a}es, Jo{\~a}o and Tuia, Devis and Martins, Bruno},
  journal={arXiv preprint arXiv:2402.06475},
  year={2024}
}

Model Card Authors

João Daniel Silva - https://github.com/DannielSilva

Downloads last month
21
Safetensors
Model size
428M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for joaodaniel/rs-clip-cap4

Finetuned
(58)
this model

Collection including joaodaniel/rs-clip-cap4