|
--- |
|
license: apache-2.0 |
|
tags: |
|
- Vision |
|
- Multi-model |
|
- Vision-Language |
|
- Remote-sensing |
|
widget: |
|
- src: >- |
|
https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png |
|
candidate_labels: playing music, playing sports |
|
example_title: Cat & Dog |
|
--- |
|
|
|
# Git-RSCLIP |
|
|
|
[Git-RSCLIP](https://arxiv.org/pdf/2501.00895) is pre-trained on the Git-10M dataset (a global-scale remote sensing image-text pair dataset, consisting of 10 million image-text pairs) at size 256x256, first released in [this repository](https://github.com/chen-yang-liu/Text2Earth). It employs a similar structure to [[google/siglip-large-patch16-256](https://huggingface.co/google/siglip-large-patch16-256)]. |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
You can use the raw model for tasks like zero-shot image classification and image-text retrieval. |
|
|
|
|
|
### How to use |
|
|
|
#### Use Git-RSCLIP to get image features |
|
|
|
```python |
|
from PIL import Image |
|
import requests |
|
from transformers import AutoProcessor, AutoModel |
|
import torch |
|
|
|
model = AutoModel.from_pretrained("lcybuaa/Git-RSCLIP") |
|
processor = AutoProcessor.from_pretrained("lcybuaa/Git-RSCLIP") |
|
|
|
url = "https://github.com/Chen-Yang-Liu/PromptCC/blob/main/Example/B/train_000051.png?raw=true" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
inputs = processor(images=image, return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
image_features = model.get_image_features(**inputs) |
|
``` |
|
|
|
|
|
#### zero-shot image classification: |
|
|
|
```python |
|
from PIL import Image |
|
import requests |
|
from transformers import AutoProcessor, AutoModel |
|
import torch |
|
|
|
model = AutoModel.from_pretrained("lcybuaa/Git-RSCLIP") |
|
processor = AutoProcessor.from_pretrained("lcybuaa/Git-RSCLIP") |
|
|
|
url = "https://github.com/Chen-Yang-Liu/PromptCC/blob/main/Example/B/train_000051.png?raw=true" |
|
image = Image.open(requests.get(url, stream=True).raw) |
|
|
|
texts = ["a remote sensing image of river", "a remote sensing image of houses and roads"] |
|
inputs = processor(text=texts, images=image, padding="max_length", return_tensors="pt") |
|
|
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
logits_per_image = outputs.logits_per_image |
|
probs = torch.sigmoid(logits_per_image) # these are the probabilities |
|
top5_indices = torch.argsort(probs, descending=True)[:, :5].cpu().numpy() |
|
top1_indices = top5_indices[:, 0] |
|
print(f"the image 0 is '{top1_indices[0]}'") |
|
``` |
|
|
|
For more code examples, we refer to the [documentation](https://huggingface.co/transformers/main/model_doc/siglip.html#). |
|
|
|
|
|
## Training procedure |
|
|
|
### Training data |
|
|
|
Git-RSCLIP is pre-trained on the Git-10M dataset (a global-scale remote sensing image-text pair dataset, consisting of 10 million image-text pairs) [(Liu et al., 2024)](https://github.com/chen-yang-liu/Text2Earth). |
|
|
|
### Preprocessing |
|
|
|
Images are resized/rescaled to the same resolution (256x256) and normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). |
|
|
|
Texts are tokenized and padded to the same length (64 tokens). |
|
|
|
|
|
## Evaluation results |
|
|
|
Evaluation of Git-RSCLIP compared to other CLIP is shown below (taken from the paper). |
|
|
|
<img src="https://github.com/Chen-Yang-Liu/Text2Earth/blob/main/images/Git-RSCLIP.png?raw=true" |
|
alt="drawing" width="1000"/> |
|
|
|
### BibTeX entry and citation info |
|
|
|
```bibtex |
|
@misc{liu2025text2earthunlockingtextdrivenremote, |
|
title={Text2Earth: Unlocking Text-driven Remote Sensing Image Generation with a Global-Scale Dataset and a Foundation Model}, |
|
author={Chenyang Liu and Keyan Chen and Rui Zhao and Zhengxia Zou and Zhenwei Shi}, |
|
year={2025}, |
|
eprint={2501.00895}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2501.00895}, |
|
} |
|
``` |