gajeshladhar commited on
Commit
a7d8629
Β·
verified Β·
1 Parent(s): d2997bb

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +169 -169
README.md CHANGED
@@ -1,169 +1,169 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- ---
4
-
5
- # 🌐 core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery
6
-
7
- [![πŸ€— Model Hub](https://img.shields.io/badge/HuggingFace-core--dino-blue?logo=huggingface&logoColor=white)](https://huggingface.co/gajeshladhar/core-dino)
8
- [![πŸš€ Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ)
9
- ![πŸ›°οΈ Domain](https://img.shields.io/badge/Domain-Earth%20Observation-green)
10
- ![πŸ” Task](https://img.shields.io/badge/Task-Self--Supervised--Learning-orange)
11
-
12
- ---
13
-
14
- ## πŸ”­ Overview
15
-
16
- `core-dino` is a resolution-agnostic **self-supervised model** designed for satellite imagery, trained on the [Core-Five dataset](https://huggingface.co/datasets/gajeshladhar/core-five) using a DiNO-inspired setup. It handles imagery between **30β€―cm and 2β€―m**, learning strong spatial features without any labels.
17
-
18
- <p>
19
- <a href="https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ" target="_blank">
20
- <b>Open Demo ▢️</b>
21
- </a> - Run multi-resolution inference & visualize spatial embeddings.</p>
22
-
23
- ---
24
-
25
- ## πŸŒ€ Quickstart
26
-
27
- ```python
28
- from ultralytics import YOLO
29
-
30
- model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model
31
- ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt"
32
- ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu')
33
- model.model.load_state_dict(
34
- {k.replace('layers.', 'model.'): v for k, v in ckpt.items()},
35
- strict=False)
36
- ```
37
-
38
- ---
39
-
40
- ## 🧠 Architecture: DiNO Γ— YOLO Γ— I-JEPA
41
-
42
- We combine three ideas to build a high-performance backbone for spatial representation learning:
43
-
44
- #### 1️⃣ **Multi-Resolution DINO Setup (instead of local-global)**
45
- > In standard [DINO](https://arxiv.org/abs/2104.14294) / [DINOv2](https://arxiv.org/abs/2304.07193), the student sees cropped or distorted views (local), while the teacher sees global views.
46
- > In `core-dino`, we replace this with **clean vs degraded resolution contrast**:
47
- - πŸ‘¨β€πŸ« **Teacher** gets clean 30β€―cm satellite imagery.
48
- - πŸ‘¨β€πŸŽ“ **Student** sees augmented versions of the same scene at varying resolutions (30β€―cm β†’ 2β€―m) with photometric and spatial distortions.
49
-
50
- This setup encourages the model to learn **scale-invariant** and **semantic-aware** features across real-world EO resolution gaps.
51
-
52
- #### 2️⃣ **I-JEPA-Style Patch Dropping**
53
- We integrate ideas from [I-JEPA](https://arxiv.org/abs/2301.08243):
54
- - Random **patch regions are dropped** from the student input.
55
- - The objective is to align the **visible patch embeddings** with the teacher’s corresponding high-resolution ones.
56
- - This enforces **local-global and partial-whole consistency** in the latent space.
57
-
58
- #### 3️⃣ **YOLOv11-X as Encoder Backbone**
59
- - We use **YOLOv11-X**, one of the most powerful and recent YOLO variants, as the spatial encoder.
60
- - The backbone is **truncated after 23 layers**, retaining rich spatial semantics while maintaining efficiency.
61
- - This provides strong priors from supervised detection tasks, now adapted for **self-supervised** learning.
62
-
63
-
64
- ---
65
-
66
- ## πŸ§ͺ Training Flow: Resolution-Agnostic DiNO
67
-
68
- The training pipeline in `core-dino` follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:
69
-
70
- #### πŸ‘¨β€πŸ« 1. Teacher View (Clean & High-Res)
71
- - Receives a **clean 30β€―cm image** without any augmentation.
72
- - Used as the stable reference to guide the student.
73
-
74
- #### πŸ‘¨β€πŸŽ“ 2. Student View (Augmented Multi-Resolution)
75
- - Receives **randomly augmented** versions of the same image:
76
- - Downsampled to **30β€―cm to 2β€―m**
77
- - Augmented with noise, blur, color jitter, spatial dropout, etc.
78
- - Emulates resolution variability common in EO imagery.
79
-
80
- #### ⚠️ 3. Spatial Misalignment & Solution
81
- - Since different student resolutions produce different spatial dimensions (H Γ— W),
82
- we use **bilinear interpolation** to **resize the student’s feature map** to match the teacher's spatial shape before computing the contrastive loss.
83
-
84
- #### 🎯 4. Objective
85
- - Align the spatial token embeddings of the student with the teacher β€” pixel-to-pixel and semantically β€” despite resolution gaps and augmentations.
86
- - Encourages **scale-invariant**, **robust** feature learning across real-world variations.
87
-
88
-
89
-
90
- ---
91
-
92
- ## πŸ“ˆ Performance: Latent Quality & Downstream Evaluation
93
-
94
- Despite being trained without any labels, `core-dino` demonstrates strong latent alignment and generalization capability β€” both in visual similarity and downstream tasks.
95
-
96
- ### πŸ›£οΈ Downstream: Road Extraction (DeepGlobe Dataset)
97
-
98
- We evaluated `core-dino` on the [DeepGlobe Road Extraction Dataset](https://competitions.codalab.org/competitions/18467#learn_the_details), using it as a frozen backbone in a simple segmentation pipeline.
99
-
100
- - **Setup:**
101
- - Both `core-dino` and **YOLOv11-X** backbones were **frozen**
102
- - Only a **2-layer convolutional head** was trained
103
- - Task: Binary road segmentation using IoU loss
104
-
105
- - **Result:**
106
- - `core-dino` consistently outperformed the supervised **YOLOv11-X** backbone across all epochs
107
- - Shows superior latent representation quality, even without task-specific supervision
108
- - Demonstrates better **generalization** and **semantic robustness** in downstream transfer tasks
109
-
110
- <p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
111
- <span style="font-size: 16px;">πŸ““ <strong>Reproduce this comparison in Colab:</strong></span>
112
- <a href="https://colab.research.google.com/drive/1JqJoboLljDc2EoqMvj40mA1Sa1vnCHry" target="_blank" style="display: inline-block; vertical-align: middle;">
113
- <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
114
- </a>
115
- </p>
116
- <p align="center"><br>
117
- <img src="assets/downstream-deepglobe-roads.png" alt="Downstream Performance" style="width:85%;">
118
- </p>
119
-
120
-
121
- ### πŸ™οΈ Downstream : Building Footprint Validation
122
-
123
- To evaluate transferability to structural segmentation tasks, we tested `core-dino` on **building footprint extraction** using high-resolution satellite imagery.
124
-
125
- - **Setup:**
126
- - Compared **YOLOv11-X (original weights)** vs. **YOLOv11-X initialized with `core-dino` weights**
127
- - Used same training pipeline for both
128
-
129
- - **Result:**
130
- - `core-dino` achieved **+15 mAP** improvement over standard YOLOv11-X
131
- - Captures edge-localized and compact building structures better
132
- - Demonstrates strong spatial precision and high-quality feature encoding
133
-
134
- <p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
135
- <span style="font-size: 16px;">πŸ““ <strong>Reproduce this comparison in Colab:</strong></span>
136
- <a href="https://colab.research.google.com/drive/1uAqUNUDQt0_29Zhvopz0rWVSAzX-cZrk" target="_blank" style="display: inline-block; vertical-align: middle;">
137
- <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
138
- </a>
139
- </p>
140
- <p align="center"><br>
141
- <img src="assets/downstream-building-footprint.png" alt="Downstream Performance" style="width:85%;">
142
- </p>
143
-
144
-
145
- ---
146
-
147
- ## πŸ—‚οΈ Model Details
148
-
149
- | Field | Value |
150
- |--------------------|--------------------------------------------------------------|
151
- | Parameters | **56.7M** |
152
- | Backbone Architecture | **YOLOv11 X** |
153
- | Input Size | **320 Γ— 320 – 4096 Γ— 4096** |
154
- | Patch Source | [Core-Five](https://huggingface.co/datasets/gajeshladhar/core-five) |
155
- | Resolutions | 30β€―cm (clean) β†’ 2β€―m (augmented) |
156
- | Patch Drop | I-JEPA-style masking |
157
- | Loss | DINO contrastive loss |
158
- | Training Time | ~48h on 1Γ—A100 |
159
-
160
-
161
- ---
162
- ## πŸ’³ License
163
-
164
- This project is released under the **[Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/)** license.
165
-
166
- > βœ… Free to use, share, and adapt for **non-commercial research**
167
- > ❌ **Commercial use is not permitted** without explicit permission
168
- > πŸ“Œ Please provide appropriate credit when using this dataset in publications or projects.
169
-
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ ---
4
+
5
+ # 🌐 core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery
6
+
7
+ [![πŸ€— Model Hub](https://img.shields.io/badge/HuggingFace-core--dino-blue?logo=huggingface&logoColor=white)](https://huggingface.co/gajeshladhar/core-dino)
8
+ [![πŸš€ Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ)
9
+ ![πŸ›°οΈ Domain](https://img.shields.io/badge/Domain-Earth%20Observation-green)
10
+ ![πŸ” Task](https://img.shields.io/badge/Task-Self--Supervised--Learning-orange)
11
+
12
+ ---
13
+
14
+ ## πŸ”­ Overview
15
+
16
+ `core-dino` is a resolution-agnostic **self-supervised model** designed for satellite imagery, trained on the [Core-Five dataset](https://huggingface.co/datasets/gajeshladhar/core-five) using a DiNO-inspired setup. It handles imagery between **20β€―cm and 2β€―m**, learning strong spatial features without any labels.
17
+
18
+ <p>
19
+ <a href="https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ" target="_blank">
20
+ <b>Open Demo ▢️</b>
21
+ </a> - Run multi-resolution inference & visualize spatial embeddings.</p>
22
+
23
+ ---
24
+
25
+ ## πŸŒ€ Quickstart
26
+
27
+ ```python
28
+ from ultralytics import YOLO
29
+
30
+ model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model
31
+ ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt"
32
+ ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu')
33
+ model.model.load_state_dict(
34
+ {k.replace('layers.', 'model.'): v for k, v in ckpt.items()},
35
+ strict=False)
36
+ ```
37
+
38
+ ---
39
+
40
+ ## 🧠 Architecture: DiNO Γ— YOLO Γ— I-JEPA
41
+
42
+ We combine three ideas to build a high-performance backbone for spatial representation learning:
43
+
44
+ #### 1️⃣ **Multi-Resolution DINO Setup (instead of local-global)**
45
+ > In standard [DINO](https://arxiv.org/abs/2104.14294) / [DINOv2](https://arxiv.org/abs/2304.07193), the student sees cropped or distorted views (local), while the teacher sees global views.
46
+ > In `core-dino`, we replace this with **clean vs degraded resolution contrast**:
47
+ - πŸ‘¨β€πŸ« **Teacher** gets clean 30β€―cm satellite imagery.
48
+ - πŸ‘¨β€πŸŽ“ **Student** sees augmented versions of the same scene at varying resolutions (30β€―cm β†’ 2β€―m) with photometric and spatial distortions.
49
+
50
+ This setup encourages the model to learn **scale-invariant** and **semantic-aware** features across real-world EO resolution gaps.
51
+
52
+ #### 2️⃣ **I-JEPA-Style Patch Dropping**
53
+ We integrate ideas from [I-JEPA](https://arxiv.org/abs/2301.08243):
54
+ - Random **patch regions are dropped** from the student input.
55
+ - The objective is to align the **visible patch embeddings** with the teacher’s corresponding high-resolution ones.
56
+ - This enforces **local-global and partial-whole consistency** in the latent space.
57
+
58
+ #### 3️⃣ **YOLOv11-X as Encoder Backbone**
59
+ - We use **YOLOv11-X**, one of the most powerful and recent YOLO variants, as the spatial encoder.
60
+ - The backbone is **truncated after 23 layers**, retaining rich spatial semantics while maintaining efficiency.
61
+ - This provides strong priors from supervised detection tasks, now adapted for **self-supervised** learning.
62
+
63
+
64
+ ---
65
+
66
+ ## πŸ§ͺ Training Flow: Resolution-Agnostic DiNO
67
+
68
+ The training pipeline in `core-dino` follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:
69
+
70
+ #### πŸ‘¨β€πŸ« 1. Teacher View (Clean & High-Res)
71
+ - Receives a **clean 30β€―cm image** without any augmentation.
72
+ - Used as the stable reference to guide the student.
73
+
74
+ #### πŸ‘¨β€πŸŽ“ 2. Student View (Augmented Multi-Resolution)
75
+ - Receives **randomly augmented** versions of the same image:
76
+ - Downsampled to **30β€―cm to 2β€―m**
77
+ - Augmented with noise, blur, color jitter, spatial dropout, etc.
78
+ - Emulates resolution variability common in EO imagery.
79
+
80
+ #### ⚠️ 3. Spatial Misalignment & Solution
81
+ - Since different student resolutions produce different spatial dimensions (H Γ— W),
82
+ we use **bilinear interpolation** to **resize the student’s feature map** to match the teacher's spatial shape before computing the contrastive loss.
83
+
84
+ #### 🎯 4. Objective
85
+ - Align the spatial token embeddings of the student with the teacher β€” pixel-to-pixel and semantically β€” despite resolution gaps and augmentations.
86
+ - Encourages **scale-invariant**, **robust** feature learning across real-world variations.
87
+
88
+
89
+
90
+ ---
91
+
92
+ ## πŸ“ˆ Performance: Latent Quality & Downstream Evaluation
93
+
94
+ Despite being trained without any labels, `core-dino` demonstrates strong latent alignment and generalization capability β€” both in visual similarity and downstream tasks.
95
+
96
+ ### πŸ›£οΈ Downstream: Road Extraction (DeepGlobe Dataset)
97
+
98
+ We evaluated `core-dino` on the [DeepGlobe Road Extraction Dataset](https://competitions.codalab.org/competitions/18467#learn_the_details), using it as a frozen backbone in a simple segmentation pipeline.
99
+
100
+ - **Setup:**
101
+ - Both `core-dino` and **YOLOv11-X** backbones were **frozen**
102
+ - Only a **2-layer convolutional head** was trained
103
+ - Task: Binary road segmentation using IoU loss
104
+
105
+ - **Result:**
106
+ - `core-dino` consistently outperformed the supervised **YOLOv11-X** backbone across all epochs
107
+ - Shows superior latent representation quality, even without task-specific supervision
108
+ - Demonstrates better **generalization** and **semantic robustness** in downstream transfer tasks
109
+
110
+ <p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
111
+ <span style="font-size: 16px;">πŸ““ <strong>Reproduce this comparison in Colab:</strong></span>
112
+ <a href="https://colab.research.google.com/drive/1JqJoboLljDc2EoqMvj40mA1Sa1vnCHry" target="_blank" style="display: inline-block; vertical-align: middle;">
113
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
114
+ </a>
115
+ </p>
116
+ <p align="center"><br>
117
+ <img src="assets/downstream-deepglobe-roads.png" alt="Downstream Performance" style="width:85%;">
118
+ </p>
119
+
120
+
121
+ ### πŸ™οΈ Downstream : Building Footprint Validation
122
+
123
+ To evaluate transferability to structural segmentation tasks, we tested `core-dino` on **building footprint extraction** using high-resolution satellite imagery.
124
+
125
+ - **Setup:**
126
+ - Compared **YOLOv11-X (original weights)** vs. **YOLOv11-X initialized with `core-dino` weights**
127
+ - Used same training pipeline for both
128
+
129
+ - **Result:**
130
+ - `core-dino` achieved **+15 mAP** improvement over standard YOLOv11-X
131
+ - Captures edge-localized and compact building structures better
132
+ - Demonstrates strong spatial precision and high-quality feature encoding
133
+
134
+ <p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
135
+ <span style="font-size: 16px;">πŸ““ <strong>Reproduce this comparison in Colab:</strong></span>
136
+ <a href="https://colab.research.google.com/drive/1uAqUNUDQt0_29Zhvopz0rWVSAzX-cZrk" target="_blank" style="display: inline-block; vertical-align: middle;">
137
+ <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
138
+ </a>
139
+ </p>
140
+ <p align="center"><br>
141
+ <img src="assets/downstream-building-footprint.png" alt="Downstream Performance" style="width:85%;">
142
+ </p>
143
+
144
+
145
+ ---
146
+
147
+ ## πŸ—‚οΈ Model Details
148
+
149
+ | Field | Value |
150
+ |--------------------|--------------------------------------------------------------|
151
+ | Parameters | **56.7M** |
152
+ | Backbone Architecture | **YOLOv11 X** |
153
+ | Input Size | **320 Γ— 320 – 4096 Γ— 4096** |
154
+ | Patch Source | [Core-Five](https://huggingface.co/datasets/gajeshladhar/core-five) |
155
+ | Resolutions | 30β€―cm (clean) β†’ 2β€―m (augmented) |
156
+ | Patch Drop | I-JEPA-style masking |
157
+ | Loss | DINO contrastive loss |
158
+ | Training Time | ~48h on 1Γ—A100 |
159
+
160
+
161
+ ---
162
+ ## πŸ’³ License
163
+
164
+ This project is released under the **[Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/)** license.
165
+
166
+ > βœ… Free to use, share, and adapt for **non-commercial research**
167
+ > ❌ **Commercial use is not permitted** without explicit permission
168
+ > πŸ“Œ Please provide appropriate credit when using this dataset in publications or projects.
169
+