File size: 2,246 Bytes
b12146a
 
9f04ecf
 
 
 
b12146a
 
9f04ecf
b12146a
9f04ecf
b12146a
9f04ecf
b12146a
 
 
 
 
 
 
9f04ecf
b12146a
9f04ecf
b12146a
9f04ecf
 
 
 
 
 
 
 
b12146a
 
 
 
 
9f04ecf
b12146a
9f04ecf
b12146a
 
 
 
9f04ecf
b12146a
9f04ecf
b12146a
 
c91ad34
b12146a
9f04ecf
 
 
 
 
 
 
 
b12146a
9f04ecf
b12146a
9f04ecf
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
---
library_name: transformers
language:
- en
base_model:
- openai/clip-vit-large-patch14
---

# CLIP ViT-Large finetuned with Remote Sensing data

This is a CLIP model of size Large that finetuned with Remote Sensing data, namely using image captioning datasets.

The original weights are from [openai/clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14).

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This model was finetuned with the typical contrastive learning objective to train CLIP models.

Training Data:

| Dataset | #Images | Image Size | Spatial Resolution | #Total Captions |
| :---: |:---: |:---: |:---: |:---: |
| NWPU-Captions | 31,500 | 256 × 256 | ∼30-0.2m | 157,500 |
| RSICD | 10,921 | 224 × 224 | different resolutions | 54,605 |
| Sydney-Captions | 613 | 500 × 500 | 0.5m | 3,065 |
| UCM-Captions | 2,100 | 256 × 256 | ∼0.3m | 10,500 |
| Cap-4 | 45,134 | 224 × 224 | different resolutions | 225,670 |
 

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

The model may inherit the biases and limitations from the original CLIP weights. 

While being trained with Remote Sensing data, it can still be limited to downstream applications of this domain, due to the nature of the training data (images of low resolution, descriptions that are too short, can contain mistakes, or are overly generic.)


## How to Get Started with the Model

Your typical huggingface interface for CLIP Models: https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPModel

## Citation

<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
This model was developed with regards to the following work. If useful, please cite us at:

```bibtex
@article{silva2024large,
  title={Large language models for captioning and retrieving remote sensing images},
  author={Silva, Jo{\~a}o Daniel and Magalh{\~a}es, Jo{\~a}o and Tuia, Devis and Martins, Bruno},
  journal={arXiv preprint arXiv:2402.06475},
  year={2024}
}
```

## Model Card Authors

João Daniel Silva - https://github.com/DannielSilva