Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,96 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- vision
|
| 4 |
+
---
|
| 5 |
+
|
| 6 |
+
# Model Card: clip-rsicd
|
| 7 |
+
|
| 8 |
+
## Model Details
|
| 9 |
+
|
| 10 |
+
This model is a fine-tuned [CLIP by OpenAI](https://huggingface.co/openai/clip-vit-base-patch32). It is designed with aim to improve zero-shot image classification, text-to-image and image-to-image retrieval specifically on remote sensing images.
|
| 11 |
+
|
| 12 |
+
### Model Date
|
| 13 |
+
|
| 14 |
+
July 2021
|
| 15 |
+
|
| 16 |
+
### Model Type
|
| 17 |
+
|
| 18 |
+
The base model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder. These encoders are trained to maximize the similarity of (image, text) pairs via a contrastive loss.
|
| 19 |
+
|
| 20 |
+
### Model Version
|
| 21 |
+
|
| 22 |
+
We release several checkpoints for `clip-rsicd` model. Refer to [our github repo](https://github.com/arampacha/CLIP-rsicd) for zero-shot classification for each of those.
|
| 23 |
+
|
| 24 |
+
### Training
|
| 25 |
+
|
| 26 |
+
To reproduce the fine-tuning procedure one can use released [script](https://github.com/arampacha/CLIP-rsicd/blob/master/run_clip_flax_tv.py).
|
| 27 |
+
The model was trained using batch size 1024, adafactor optimizer with linear warmup and decay with peak learning rate 1e-4 on 1 TPU-v3-8.
|
| 28 |
+
Full log of the training run done to produce can be found on [WandB](https://wandb.ai/wandb/hf-flax-clip-rsicd/runs/2dj1exsw).
|
| 29 |
+
|
| 30 |
+
### Demo
|
| 31 |
+
|
| 32 |
+
Checko out the model text-to-image and image-to-image capabilities using [this demo](https://huggingface.co/spaces/sujitpal/clip-rsicd-demo).
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
### Documents
|
| 36 |
+
|
| 37 |
+
- [Fine-tuning CLIP on RSICD with HuggingFace and flax/jax on colab using TPU](https://colab.research.google.com/github/arampacha/CLIP-rsicd/blob/master/nbs/Fine_tuning_CLIP_with_HF_on_TPU.ipynb)
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
### Use with Transformers
|
| 41 |
+
|
| 42 |
+
```python3
|
| 43 |
+
from PIL import Image
|
| 44 |
+
import requests
|
| 45 |
+
|
| 46 |
+
from transformers import CLIPProcessor, CLIPModel
|
| 47 |
+
|
| 48 |
+
model = CLIPModel.from_pretrained("flax-community/clip-rsicd-v2")
|
| 49 |
+
processor = CLIPProcessor.from_pretrained("flax-community/clip-rsicd-v2")
|
| 50 |
+
|
| 51 |
+
url = "https://raw.githubusercontent.com/arampacha/CLIP-rsicd/master/data/stadium_1.jpg"
|
| 52 |
+
image = Image.open(requests.get(url, stream=True).raw)
|
| 53 |
+
|
| 54 |
+
labels = ["residential area", "playground", "stadium", "forest", "airport"]
|
| 55 |
+
inputs = processor(text=[f"a photo of a {l}" for l in labels], images=image, return_tensors="pt", padding=True)
|
| 56 |
+
|
| 57 |
+
outputs = model(**inputs)
|
| 58 |
+
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
|
| 59 |
+
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
|
| 60 |
+
for l, p in zip(labels, probs[0]):
|
| 61 |
+
print(f"{l:<16} {p:.4f}")
|
| 62 |
+
```
|
| 63 |
+
[Try it on colab](https://colab.research.google.com/github/arampacha/CLIP-rsicd/blob/master/nbs/clip_rsicd_zero_shot.ipynb)
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
## Model Use
|
| 67 |
+
|
| 68 |
+
### Intended Use
|
| 69 |
+
|
| 70 |
+
The model is intended as a research output for research communities. We hope that this model will enable researchers to better understand and explore zero-shot, arbitrary image classification. We also hope it can be used for interdisciplinary studies of the potential impact of such models - the CLIP paper includes a discussion of potential downstream impacts to provide an example for this sort of analysis.
|
| 71 |
+
|
| 72 |
+
#### Primary intended uses
|
| 73 |
+
|
| 74 |
+
The primary intended users of these models are AI researchers.
|
| 75 |
+
|
| 76 |
+
We primarily imagine the model will be used by researchers to better understand robustness, generalization, and other capabilities, biases, and constraints of computer vision models.
|
| 77 |
+
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
## Data
|
| 81 |
+
|
| 82 |
+
The model was trained on publicly available remote sensing image cations datasets. Namely [RSICD](https://github.com/201528014227051/RSICD_optimal), [UCM](https://mega.nz/folder/wCpSzSoS#RXzIlrv--TDt3ENZdKN8JA) and [Sydney](https://mega.nz/folder/pG4yTYYA#4c4buNFLibryZnlujsrwEQ).
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
## Performance and Limitations
|
| 86 |
+
|
| 87 |
+
### Performance
|
| 88 |
+
|
| 89 |
+
| Model-name | k=1 | k=3 | k=5 | k=10 |
|
| 90 |
+
| -------------------------------- | ----- | ----- | ----- | ----- |
|
| 91 |
+
| original CLIP | 0.572 | 0.745 | 0.837 | 0.939 |
|
| 92 |
+
| clip-rsicd-v2 (this model) | **0.883** | **0.968** | **0.982** | **0.998** |
|
| 93 |
+
|
| 94 |
+
## Limitations
|
| 95 |
+
|
| 96 |
+
The model is fine-tuned on RSI data but can contain some biases and limitations of the original CLIP model. Refer to [CLIP model card](https://huggingface.co/openai/clip-vit-base-patch32#limitations) for details on those.
|