CLIP-RS: Vision-Language Pre-training with Data Purification for Remote Sensing

CLIP-RS is a pre-trained model based on CLIP (Contrastive Language-Image Pre-training) tailored for remote sensing applications. This model is trained on a 10M large-scale remote sensing image-text dataset, providing powerful perception capabilities for tasks related to remote sensing images.

Paper

For a detailed explanation of CLIP-RS, refer to the following paper:

Remote Sensing Semantic Segmentation Quality Assessment based on Vision Language Model

Introduction

CLIP-RS is ingeniously developed by building upon the framework of CLIP, with a specific adaptation tailored to remote sensing imagery While CLIP excels at understanding general semantic content, it faces challenges when applied to the remote sensing domain due to the lack of sufficient high-quality training data. To address this, we have constructed a large-scale dataset containing 10 million remote sensing image-text pairs. This dataset is carefully refined using a semantic-similarity strategy to eliminate low-quality captions, ensuring that the model learns high-quality semantic features. The high-quaility big-scale pre-training improves the model’s semantic perception and contextual understanding of remote sensing images, making it a valuable tool for various geospatial analysis tasks.

CLIP-RS construction pipeline can be summarized as follows:

1. Data Collection

The training data is sourced from two types of datasets:

High-quality captions: Approximately 1.5 million images from datasets like SkyScript, where each image is paired with a carefully generated description.
Coarse semantic labels: 8.5 million images paired with captions of varying quality, ranging from well-defined descriptions to noisy and less relevant text.

2. Data Filtering

To refine the coarse dataset, we propose a data filtering strategy using the CLIP-based model, CLIP_Sem. This model is pre-trained on high-quality captions to ensure that only semantically accurate image-text pairs are retained. The similarity scores (SS) between each image-text pair are calculated, and captions with low similarity are discarded.

Figure 1: Data Refinement Process of the CLIP-RS Dataset. Left: Workflow for filtering and refining low-quality captions. Right: Examples of low-quality captions and their refined versions.

3. Data Refinement

The remaining low-quality captions are refined using a remote sensing-specific multimodal language model, GeoChat. GeoChat generates more accurate and detailed captions, ensuring that each image is described with high semantic relevance. This process significantly improves the quality of the dataset by removing noise and inaccuracies.

4. Vision-Language Pre-training

Once the data is purified, CLIP-RS is fine-tuned on this high-quality dataset using the CLIP framework. The model is continually pre-trained to specialize in remote sensing imagery, allowing it to effectively capture both visual and textual semantics

Model Downloads

You can download the pre-trained CLIP-RS ViT-L14 model from the following link:

Download CLIP-RS Model

Evaluation Results

CLIP-RS has been applied to remote sensing semantic segmentation quality assessment tasks, achieving state-of-the-art performance. The following are the accuracy results of predicting the best semantic segmentation method among eight alternative methods on the semantic segmentation datasets:

Model	RS-SQED	ISPRS	LoveDA	UAVid	FloodNet
RemoteCLIP	0.6906	0.3252	0.7057	0.6125	0.8302
CLIP-RS (1.5M)	0.7276	0.3839	0.7117	0.8313	0.8333
CLIP-RS (10M)	0.7328	0.3883	0.7748	0.8375	0.8069

For more detailed results, refer to the Evaluation Section of the Paper.

Citation

If you use CLIP-RS in your research, please cite our paper:

@misc{shi2025remotesensingsemanticsegmentation,
      title={Remote Sensing Semantic Segmentation Quality Assessment based on Vision Language Model}, 
      author={Huiying Shi and Zhihong Tan and Zhihan Zhang and Hongchen Wei and Yaosi Hu and Yingxue Zhang and Zhenzhong Chen},
      year={2025},
      eprint={2502.13990},
      archivePrefix={arXiv},
      primaryClass={eess.IV},
      url={https://arxiv.org/abs/2502.13990}, 
}