CLIP ViT-Large finetuned with Remote Sensing data

This is a CLIP model of size Large that finetuned with Remote Sensing data, namely using image captioning datasets.

The original weights are from openai/clip-vit-large-patch14.

Model Details

Model Description

This model was finetuned with the typical contrastive learning objective to train CLIP models.

Training Data:

Dataset	#Images	Image Size	Spatial Resolution	#Total Captions
NWPU-Captions	31,500	256 × 256	∼30-0.2m	157,500
RSICD	10,921	224 × 224	different resolutions	54,605
Sydney-Captions	613	500 × 500	0.5m	3,065
UCM-Captions	2,100	256 × 256	∼0.3m	10,500
Cap-4	45,134	224 × 224	different resolutions	225,670

Bias, Risks, and Limitations

The model may inherit the biases and limitations from the original CLIP weights.

While being trained with Remote Sensing data, it can still be limited to downstream applications of this domain, due to the nature of the training data (images of low resolution, descriptions that are too short, can contain mistakes, or are overly generic.)

How to Get Started with the Model

Your typical huggingface interface for CLIP Models: https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel

Citation

This model was developed with regards to the following work. If useful, please cite us at:

@article{silva2024large,
  title={Large language models for captioning and retrieving remote sensing images},
  author={Silva, Jo{\~a}o Daniel and Magalh{\~a}es, Jo{\~a}o and Tuia, Devis and Martins, Bruno},
  journal={arXiv preprint arXiv:2402.06475},
  year={2024}
}

Model Card Authors

João Daniel Silva - https://github.com/DannielSilva

joaodaniel
/

rs-clip-cap4