File size: 11,284 Bytes
4b72682
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3afba25
 
 
 
 
 
 
 
 
4b72682
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
---
license: mit
library_name: transformers
pipeline_tag: image-feature-extraction
---

<p align="center">

  <h2 align="center"><strong>RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing</strong></h2>

<div align="center">
<h5>
<em>Fengxiang Wang<sup>1</sup>, Hongzhen Wang<sup>2 †</sup>, Yulin Wang<sup>2</sup>, Di Wang<sup>3</sup>, Mingshuo Chen<sup>4</sup>, Haiyan Zhao<sup>2</sup>,<br/> Yangang Sun<sup>2</sup>, Shuo Wang<sup>2</sup>, Long Lan<sup>1</sup>, Wenjing Yang<sup>1 †</sup>, Jing Zhang<sup>3 †</sup> </em>
    <br><br>
       	<sup>1</sup> National University of Defense Technology, China, <sup>2</sup> Tsinghua University, China, <br/> <sup>3</sup> Wuhan University, China, <sup>4</sup> Beijing University of Posts and Telecommunications, China
    </h5>
</div>

<h5 align="center">
<a href="https://huggingface.co/initiacms/RoMA"> <img src="https://img.shields.io/badge/🤗-Checkpoints-9C276A.svg"></a> <a href="https://arxiv.org/abs/2503.10392"> <img src="https://img.shields.io/badge/Arxiv-2503.10392-b31b1b.svg?logo=arXiv"></a>
</h5>

# 📚 Contents

- [News](#🔥news)
- [Abstract](#📄abstract)
- [Overview](#🔍overview)
- [Evaluation Results](#✅evaluation-results)
- [Scaling Behavior](#📈scaling-behavior)
- [Pretraining](#🚀pretraining)
- [Checkpoints](#🎯checkpoints)
- [Citation](#🔗citation)
- [Acknowledgement](#🤝acknowledgement)

# 🔥News

* **[2025.03.13]**  The paper is available on [arXiv](http://arxiv.org/abs/2503.10392).

# 📄Abstract

Recent advances in self-supervised learning for Vision Transformers (ViTs) have fueled breakthroughs in remote sensing (RS) foundation models. However, the quadratic complexity of self-attention poses a significant barrier to scalability, particularly for large models and high-resolution images. While the linear-complexity Mamba architecture offers a promising alternative, existing RS applications of Mamba remain limited to supervised tasks on small, domain-specific datasets. To address these challenges, we propose RoMA, a framework that enables scalable self-supervised pretraining of Mamba-based RS foundation models using large-scale, diverse, unlabeled data. 

# 🔍Overview

<figure>
<img src="assets/image-20250311170540530.png">
<figcaption align = "center"><b>Figure 1: Overview of the RoMA Pretraining Pipeline. 
 </b></figcaption>
</figure>

The input image is first divided into patches, and high-value patches are selected for random rotation using the Adaptive Rotation Encoding Strategy. These patches are then tokenized and processed by the Mamba encoder. The encoded features undergo autoregressive next-token prediction, followed by a multi-scale strategy that computes losses at different scales for gradient updates. RoMA optimally adapts the Mamba architecture for remote sensing, making its encoder a robust feature extractor for diverse downstream tasks.

# ✅Evaluation Results

<table border="1" cellpadding="5" cellspacing="0" style="border-collapse: collapse; width: 100%;">
  <caption style="caption-side: top; text-align: center; font-weight: bold; margin-bottom: 10px;">
    Results for scene classification, change detection, and semantic segmentation.<br>
    “TR” represents the ratio of training data.<br>
   <sup>★</sup> indicates results from MA3E and MTP.<br>
      <sup>†</sup> denotes our reproduction with their official code.
  </caption>
  <colgroup>
    <col style="width: 25%;"> <!-- Set first column width -->
    <col> <!-- Auto-width for the other columns -->
    <col>
    <col>
    <col>
    <col>
    <col>
    <col>
  </colgroup>
<thead>
  <tr>
    <th rowspan="3">Methods</th>
    <th rowspan="3">Publication</th>
    <th rowspan="3">Backbone</th>
    <th rowspan="3">Params</th>
    <th colspan="2">Scene Classification</th>
    <th colspan="1">Change Detection</th>
    <th colspan="1">Semantic Segmentation</th>
  </tr>
  <tr>
    <th>AID</th>
    <th>UCM</th>
    <th>OSCD</th>
    <th>SpaceNetv1</th>
  </tr>
  <tr>
    <th>OA(TR=50%)</th>
    <th>OA(TR=50%)</th>
    <th>F1</th>
    <th>mF1</th>
  </tr>
</thead>
  <tbody>
    <tr>
      <td colspan="8" style="text-align: left;"><em style="color: gray;">Natural Image pretraining</em></td>
    </tr>
    <tr>
      <td>MoCo v3<sup>★</sup></td>
      <td>ICCV'21</td>
      <td>ViT-B</td>
      <td>86M</td>
      <td>78.72</td>
      <td>38.34</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>DINO<sup>★</sup></td>
      <td>ICCV'21</td>
      <td>ViT-B</td>
      <td>86M</td>
      <td>78.51</td>
      <td>40.04</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>MAE<sup>★</sup></td>
      <td>CVPR'22</td>
      <td>ViT-B</td>
      <td>86M</td>
      <td>84.21</td>
      <td>52.75</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>SimMIM<sup>★</sup></td>
      <td>CVPR'22</td>
      <td>ViT-B</td>
      <td>86M</td>
      <td>83.19</td>
      <td>51.48</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>LoMaR<sup>★</sup></td>
      <td>Arxiv'22</td>
      <td>ViT-B</td>
      <td>86M</td>
      <td>82.26</td>
      <td>51.89</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>MixMAE<sup>★</sup></td>
      <td>CVPR'23</td>
      <td>Swin-B/W14</td>
      <td>88M</td>
      <td>81.53</td>
      <td>50.63</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>ARM<sup>†</sup></td>
      <td>ICLR'25</td>
      <td>Mamba-B</td>
      <td>85M</td>
      <td>81.14</td>
      <td>50.41</td>
      <td>47.28</td>
      <td>77.89</td>
    </tr>
    <tr>
      <td colspan="8" style="text-align: left;"><em style="color: gray;">RS Image pretraining</em></td>
    </tr>
    <tr>
      <td>SeCo<sup>★</sup></td>
      <td>ICCV'21</td>
      <td>ResNet-50</td>
      <td>25.6M</td>
      <td>78.26</td>
      <td>47.45</td>
      <td>47.67</td>
      <td>77.09</td>
    </tr>
    <tr>
      <td>CACo<sup>★</sup></td>
      <td>CVPR'23</td>
      <td>ResNet-50</td>
      <td>25.6M</td>
      <td>77.81</td>
      <td>40.53</td>
      <td>52.11</td>
      <td>77.94</td>
    </tr>
    <tr>
      <td>SatMAE<sup>★</sup></td>
      <td>NIPS'22</td>
      <td>ViT-L</td>
      <td>307M</td>
      <td>55.10</td>
      <td>34.28</td>
      <td>52.76</td>
      <td>78.07</td>
    </tr>
    <tr>
      <td>ScaleMAE<sup>★</sup></td>
      <td>ICCV'23</td>
      <td>ViT-L</td>
      <td>307M</td>
      <td>48.46</td>
      <td>28.19</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>GFM<sup>★</sup></td>
      <td>ICCV'23</td>
      <td>Swin-B</td>
      <td>88M–</td>
      <td>80.58</td>
      <td>49.73</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>RVSA<sup>★</sup></td>
      <td>TGRS'23</td>
      <td>ViT-B+RVSA</td>
      <td>86M</td>
      <td>84.06</td>
      <td>50.86</td>
      <td>50.28</td>
      <td><strong>79.56</strong></td>
    </tr>
    <tr>
      <td>SatMAE++<sup>†</sup></td>
      <td>CVPR'24</td>
      <td>ViT-L</td>
      <td>307M</td>
      <td>85.98</td>
      <td>55.72</td>
      <td>53.10</td>
      <td>79.21</td>
    </tr>
    <tr>
      <td>MA3E<sup>★</sup></td>
      <td>ECCV'24</td>
      <td>ViT-B</td>
      <td>86M</td>
      <td>85.86</td>
      <td>55.69</td>
      <td>-</td>
      <td>-</td>
    </tr>
    <tr>
      <td>RoMA</td>
      <td><a href="https://pan.baidu.com/s/1e7VOvca7894hugM-f2UitQ?pwd=e1up">Baidu</a> & <a href="https://huggingface.co/initiacms/RoMA">Hugging Face</a>
</td>
      <td>Mamba-B</td>
      <td>85M</td>
      <td><strong>87.36</strong></td>
      <td><strong>59.45</strong></td>
      <td><strong>55.63</strong></td>
      <td>79.50</td>
    </tr>
  </tbody>
</table>

For implementation of each task, please check the corresponding folder for more details.

* [Scene Classification](https://github.com/MiliLab/RoMA/tree/main/Scene%20Classification) 
* [Change Detection](https://github.com/MiliLab/RoMA/tree/main/Change%20Detection)
* [Semantic Segmentation](https://github.com/MiliLab/RoMA/tree/main/Semantic%20Segmentation)

# 📈Scaling Behavior

<figure>
<img src="assets/image-20250312111728161.png">
<figcaption align = "center"><b>Figure 2: Scaling with Data Volume. 
 </b></figcaption>
</figure>

Mamba shows a clear performance boost on downstream tasks as the pretraining data volume grows. We pretrain the Mamba-Base model with RoMA across various data scales and evaluate its performance in the downstream tasks. As illustrated in Figure 2, larger datasets lead to significant improvements. Mamba-based RSFMs exhibit no significant performance bottlenecks across a broad pretraining data scale from 62.5K to 4M, achieving data learning capabilities on par with ViT-based RSFMs. 

<figure>
<img src="assets/image-20250312112103330.png">
<figcaption align = "center"><b>Figure 3: Scaling with Model Size. 
 </b></figcaption>
</figure>

Mamba’s performance also improves with increasing model size. We conduct extensive pretraining on four model variants—Tiny, Small, Base, and Large—following the configurations in our code. As shown in Figure 3, larger models consistently achieve superior results on downstream tasks. Although Mamba-Large surpasses Mamba-Base in AID dataset, its performance gain remains limited, likely due to insufficient pretraining. With only 300 epochs on 4 million samples, the training may not be adequate for a 297M-parameter model. Due to experimental constraints, we did not extend pretraining to 800 epochs as in MAE. The OSCD and SpaceNet experiments are ongoing, with updates to follow. However, these results do not alter our key findings: Mamba-based RSFMs pretrained with RoMA demonstrate performance gains as model parameters scale. 

# 🚀Pretraining

For environment setup and pretraining instructions, please refer to [RoMA/requirements.txt](https://github.com/MiliLab/RoMA/blob/main/RoMA/requirements.txt)  and [RoMA/train.sh](https://github.com/MiliLab/RoMA/blob/main/RoMA/train.sh).

# 🎯Checkpoints

We provide our pretrained weights in <a href="https://pan.baidu.com/s/1e7VOvca7894hugM-f2UitQ?pwd=e1up">Baidu</a> & <a href="https://huggingface.co/initiacms/RoMA">Hugging Face</a>.

# 🔗Citation

If you find RoMA helpful, please consider citing:

```latex
@article{wang2025roma,
  title={RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing},
  author={Fengxiang Wang and Hongzhen Wang and Yulin Wang and Di Wang and Mingshuo Chen and Haiyan Zhao and Yangang Sun and Shuo Wang and Long Lan and Wenjing Yang and Jing Zhang},
  journal={arXiv preprint arXiv:2503.10392},
  year={2025}
}
```

# 🤝Acknowledgement

* [ARM](https://github.com/OliverRensu/ARM/tree/main): Autoregressive Pretraining with Mamba in Vision.
* [MTP](https://github.com/ViTAE-Transformer/MTP): MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining.
* [RSP](https://github.com/ViTAE-Transformer/RSP): An Empirical Study of Remote Sensing Pretraining.
* [open-cd](https://github.com/likyoo/open-cd): An open source change detection toolbox based on a series of open source general vision task tools.
* [mmcv](https://github.com/open-mmlab/mmcv), [mmsegmentation](https://github.com/open-mmlab/mmsegmentation): OpenMMLab Toolbox.