CLIP ViT-Large finetuned with Remote Sensing data
This is a CLIP model of size Large that finetuned with Remote Sensing data, namely using image captioning datasets.
The original weights are from openai/clip-vit-large-patch14.
Model Details
Model Description
This model was finetuned with the typical contrastive learning objective to train CLIP models.
Training Data:
Dataset | #Images | Image Size | Spatial Resolution | #Total Captions |
---|---|---|---|---|
NWPU-Captions | 31,500 | 256 × 256 | ∼30-0.2m | 157,500 |
RSICD | 10,921 | 224 × 224 | different resolutions | 54,605 |
Sydney-Captions | 613 | 500 × 500 | 0.5m | 3,065 |
UCM-Captions | 2,100 | 256 × 256 | ∼0.3m | 10,500 |
Cap-4 | 45,134 | 224 × 224 | different resolutions | 225,670 |
Bias, Risks, and Limitations
The model may inherit the biases and limitations from the original CLIP weights.
While being trained with Remote Sensing data, it can still be limited to downstream applications of this domain, due to the nature of the training data (images of low resolution, descriptions that are too short, can contain mistakes, or are overly generic.)
How to Get Started with the Model
Your typical huggingface interface for CLIP Models: https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel
Citation
This model was developed with regards to the following work. If useful, please cite us at:
@article{silva2024large,
title={Large language models for captioning and retrieving remote sensing images},
author={Silva, Jo{\~a}o Daniel and Magalh{\~a}es, Jo{\~a}o and Tuia, Devis and Martins, Bruno},
journal={arXiv preprint arXiv:2402.06475},
year={2024}
}
Model Card Authors
João Daniel Silva - https://github.com/DannielSilva
- Downloads last month
- 21
Model tree for joaodaniel/rs-clip-cap4
Base model
openai/clip-vit-large-patch14