hy1111
/

CLIP-RS

Model card Files Files and versions Community

hy1111 commited on 18 days ago

Commit

3243826

verified ·

1 Parent(s): 8936941

Upload 3 files

Browse files

Files changed (4) hide show

.gitattributes +2 -1
README.md +71 -0
figure/CLIP-RS.png +3 -0
figure/newversion.png +3 -0

.gitattributes CHANGED Viewed

@@ -33,4 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
-*.pt filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -textfigure/CLIP-RS.png filter=lfs diff=lfs merge=lfs -text
+figure/newversion.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,71 @@

+# CLIP-RS: Vision-Language Pre-training with Data Purification for Remote Sensing
+![CLIP-RS Logo](CLIP-RS.png)
+CLIP-RS is a pre-trained model based on CLIP (Contrastive Language-Image Pre-training) tailored for remote sensing applications. This model is trained on a 10M large-scale remote sensing image-text dataset, providing powerful perception capabilities for tasks related to remote sensing images.
+## Paper
+For a detailed explanation of CLIP-RS, refer to the following paper:
+[Remote Sensing Semantic Segmentation Quality Assessment based on Vision Language Model](https://arxiv.org/abs/2502.13990)
+## Introduction
+CLIP-RS is ingeniously developed by building upon the framework of CLIP, with a specific adaptation tailored to remote sensing imagery While CLIP excels at understanding general semantic content, it faces challenges when applied to the remote sensing domain due to the lack of sufficient high-quality training data. To address this, we have constructed a large-scale dataset containing 10 million remote sensing image-text pairs. This dataset is carefully refined using a semantic-similarity strategy to eliminate low-quality captions, ensuring that the model learns high-quality semantic features. The high-quaility big-scale pre-training improves the model’s semantic perception and contextual understanding of remote sensing images, making it a valuable tool for various geospatial analysis tasks.
+## Model Training
+CLIP-RS construction pipeline can be summarized as follows:
+### 1. Data Collection
+The training data is sourced from two types of datasets:
+- **High-quality captions**: Approximately 1.5 million images from datasets like SkyScript, where each image is paired with a carefully generated description.
+- **Coarse semantic labels**: 8.5 million images paired with captions of varying quality, ranging from well-defined descriptions to noisy and less relevant text.
+### 2. Data Filtering
+To refine the coarse dataset, we propose a data filtering strategy using the CLIP-based model, $\text{CLIP}_{\text{Sem}}$. This model is pre-trained on high-quality captions to ensure that only semantically accurate image-text pairs are retained. The similarity scores (SS) between each image-text pair are calculated, and captions with low similarity are discarded.
+![Data Purification Process](newversion.png)
+*Figure 1: Data Refinement Process of the CLIP-RS Dataset. Left: Workflow for filtering and refining low-quality captions. Right: Examples of low-quality captions and their refined versions.*
+### 3. Data Refinement
+The remaining low-quality captions are refined using a remote sensing-specific multimodal language model, GeoChat. GeoChat generates more accurate and detailed captions, ensuring that each image is described with high semantic relevance. This process significantly improves the quality of the dataset by removing noise and inaccuracies.
+### 4. Vision-Language Pre-training
+Once the data is purified, CLIP-RS is fine-tuned on this high-quality dataset using the CLIP framework. The model is continually pre-trained to specialize in remote sensing imagery, allowing it to effectively capture both visual and textual semantics
+## Model Downloads
+You can download the pre-trained CLIP-RS model from the following link:
+- [Download CLIP-RS Model](https://huggingface.co/hy1111/CLIP-RS/resolve/main/CLIP_RS.pt?download=true)
+## Evaluation Results
+CLIP-RS has been applied to remote sensing semantic segmentation quality assessment tasks, achieving state-of-the-art performance. The following are the accuracy results of predicting the best semantic segmentation method among eight alternative methods on the semantic segmentation datasets:
+| Model |RS-SQED | ISPRS  | LoveDA | UAVid | FloodNet |
+| ---- | ---- | ---- | ---- | ---- | ---- |
+| RemoteCLIP  | 0.6906 | 0.3252  | 0.7057 | 0.6125 | 0.8302  |
+| CLIP-RS (1.5M)  | 0.7276 | 0.3839  | 0.7117 | 0.8313 | 0.8333  |
+| CLIP-RS (10M) | 0.7328 | 0.3883  | 0.7748  | 0.8375 | 0.8069  |
+For more detailed results, refer to the [Evaluation Section of the Paper](https://arxiv.org/abs/2502.13990).
+## Citation
+If you use CLIP-RS in your research, please cite our paper:
+```
+@misc{shi2025remotesensingsemanticsegmentation,
+      title={Remote Sensing Semantic Segmentation Quality Assessment based on Vision Language Model},
+      author={Huiying Shi and Zhihong Tan and Zhihan Zhang and Hongchen Wei and Yaosi Hu and Yingxue Zhang and Zhenzhong Chen},
+      year={2025},
+      eprint={2502.13990},
+      archivePrefix={arXiv},
+      primaryClass={eess.IV},
+      url={https://arxiv.org/abs/2502.13990},
+}
+```

figure/CLIP-RS.png ADDED Viewed

Git LFS Details

SHA256: 7be7b930c5883dd16bb333785bf4de66b69fd19c806744cee1a49df1f454d92d
Pointer size: 132 Bytes
Size of remote file: 1.14 MB

figure/newversion.png ADDED Viewed

Git LFS Details

SHA256: b5cda1742d75d6a37dea025c119ddb6ce16528965e4984a526f80c02606de232
Pointer size: 131 Bytes
Size of remote file: 431 kB