File size: 2,246 Bytes
b12146a 9f04ecf b12146a 9f04ecf b12146a 9f04ecf b12146a 9f04ecf b12146a 9f04ecf b12146a 9f04ecf b12146a 9f04ecf b12146a 9f04ecf b12146a 9f04ecf b12146a 9f04ecf b12146a 9f04ecf b12146a c91ad34 b12146a 9f04ecf b12146a 9f04ecf b12146a 9f04ecf |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 |
---
library_name: transformers
language:
- en
base_model:
- openai/clip-vit-large-patch14
---
# CLIP ViT-Large finetuned with Remote Sensing data
This is a CLIP model of size Large that finetuned with Remote Sensing data, namely using image captioning datasets.
The original weights are from [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14).
## Model Details
### Model Description
<!-- Provide a longer summary of what this model is. -->
This model was finetuned with the typical contrastive learning objective to train CLIP models.
Training Data:
| Dataset | #Images | Image Size | Spatial Resolution | #Total Captions |
| :---: |:---: |:---: |:---: |:---: |
| NWPU-Captions | 31,500 | 256 × 256 | ∼30-0.2m | 157,500 |
| RSICD | 10,921 | 224 × 224 | different resolutions | 54,605 |
| Sydney-Captions | 613 | 500 × 500 | 0.5m | 3,065 |
| UCM-Captions | 2,100 | 256 × 256 | ∼0.3m | 10,500 |
| Cap-4 | 45,134 | 224 × 224 | different resolutions | 225,670 |
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
The model may inherit the biases and limitations from the original CLIP weights.
While being trained with Remote Sensing data, it can still be limited to downstream applications of this domain, due to the nature of the training data (images of low resolution, descriptions that are too short, can contain mistakes, or are overly generic.)
## How to Get Started with the Model
Your typical huggingface interface for CLIP Models: https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
This model was developed with regards to the following work. If useful, please cite us at:
```bibtex
@article{silva2024large,
title={Large language models for captioning and retrieving remote sensing images},
author={Silva, Jo{\~a}o Daniel and Magalh{\~a}es, Jo{\~a}o and Tuia, Devis and Martins, Bruno},
journal={arXiv preprint arXiv:2402.06475},
year={2024}
}
```
## Model Card Authors
João Daniel Silva - https://github.com/DannielSilva
|