gajeshladhar
/

core-dino

Model card Files Files and versions

xet

Community

gajeshladhar commited on Jul 4

Commit

a7d8629

verified ·

1 Parent(s): d2997bb

Update README.md

Browse files

Files changed (1) hide show

README.md +169 -169

README.md CHANGED Viewed

@@ -1,169 +1,169 @@
----
-license: cc-by-nc-4.0
----
-# 🌐 core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery
-[![🤗 Model Hub](https://img.shields.io/badge/HuggingFace-core--dino-blue?logo=huggingface&logoColor=white)](https://huggingface.co/gajeshladhar/core-dino)
-[![🚀 Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ)
-![🛰️ Domain](https://img.shields.io/badge/Domain-Earth%20Observation-green)
-![🔍 Task](https://img.shields.io/badge/Task-Self--Supervised--Learning-orange)
----
-## 🔭 Overview
-`core-dino` is a resolution-agnostic **self-supervised model** designed for satellite imagery, trained on the [Core-Five dataset](https://huggingface.co/datasets/gajeshladhar/core-five) using a DiNO-inspired setup. It handles imagery between **30 cm and 2 m**, learning strong spatial features without any labels.
-<p>
-<a href="https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ" target="_blank">
-  <b>Open Demo ▶️</b>
-</a> - Run multi-resolution inference & visualize spatial embeddings.</p>
----
-## 🌀 Quickstart
-```python
-from ultralytics import YOLO
-model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model
-ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt"
-ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu')
-model.model.load_state_dict(
-    {k.replace('layers.', 'model.'): v for k, v in ckpt.items()},
-    strict=False)
-```
----
-## 🧠 Architecture: DiNO × YOLO × I-JEPA
-We combine three ideas to build a high-performance backbone for spatial representation learning:
-#### 1️⃣ **Multi-Resolution DINO Setup (instead of local-global)**
-> In standard [DINO](https://arxiv.org/abs/2104.14294) / [DINOv2](https://arxiv.org/abs/2304.07193), the student sees cropped or distorted views (local), while the teacher sees global views.
-> In `core-dino`, we replace this with **clean vs degraded resolution contrast**:
-- 👨‍🏫 **Teacher** gets clean 30 cm satellite imagery.
-- 👨‍🎓 **Student** sees augmented versions of the same scene at varying resolutions (30 cm → 2 m) with photometric and spatial distortions.
-This setup encourages the model to learn **scale-invariant** and **semantic-aware** features across real-world EO resolution gaps.
-#### 2️⃣ **I-JEPA-Style Patch Dropping**
-We integrate ideas from [I-JEPA](https://arxiv.org/abs/2301.08243):
-- Random **patch regions are dropped** from the student input.
-- The objective is to align the **visible patch embeddings** with the teacher’s corresponding high-resolution ones.
-- This enforces **local-global and partial-whole consistency** in the latent space.
-#### 3️⃣ **YOLOv11-X as Encoder Backbone**
-- We use **YOLOv11-X**, one of the most powerful and recent YOLO variants, as the spatial encoder.
-- The backbone is **truncated after 23 layers**, retaining rich spatial semantics while maintaining efficiency.
-- This provides strong priors from supervised detection tasks, now adapted for **self-supervised** learning.
----
-## 🧪 Training Flow: Resolution-Agnostic DiNO
-The training pipeline in `core-dino` follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:
-#### 👨‍🏫  1. Teacher View (Clean & High-Res)
-- Receives a **clean 30 cm image** without any augmentation.
-- Used as the stable reference to guide the student.
-#### 👨‍🎓 2. Student View (Augmented Multi-Resolution)
-- Receives **randomly augmented** versions of the same image:
-  - Downsampled to **30 cm to 2 m**
-  - Augmented with noise, blur, color jitter, spatial dropout, etc.
-- Emulates resolution variability common in EO imagery.
-#### ⚠️ 3. Spatial Misalignment & Solution
-- Since different student resolutions produce different spatial dimensions (H × W),
-  we use **bilinear interpolation** to **resize the student’s feature map** to match the teacher's spatial shape before computing the contrastive loss.
-#### 🎯 4. Objective
-- Align the spatial token embeddings of the student with the teacher — pixel-to-pixel and semantically — despite resolution gaps and augmentations.
-- Encourages **scale-invariant**, **robust** feature learning across real-world variations.
----
-## 📈 Performance: Latent Quality & Downstream Evaluation
-Despite being trained without any labels, `core-dino` demonstrates strong latent alignment and generalization capability — both in visual similarity and downstream tasks.
-### 🛣️ Downstream: Road Extraction (DeepGlobe Dataset)
-We evaluated `core-dino` on the [DeepGlobe Road Extraction Dataset](https://competitions.codalab.org/competitions/18467#learn_the_details), using it as a frozen backbone in a simple segmentation pipeline.
-- **Setup:**
-  - Both `core-dino` and **YOLOv11-X** backbones were **frozen**
-  - Only a **2-layer convolutional head** was trained
-  - Task: Binary road segmentation using IoU loss
-- **Result:**
-  - `core-dino` consistently outperformed the supervised **YOLOv11-X** backbone across all epochs
-  - Shows superior latent representation quality, even without task-specific supervision
-  - Demonstrates better **generalization** and **semantic robustness** in downstream transfer tasks
-<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
-  <span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span>
-  <a href="https://colab.research.google.com/drive/1JqJoboLljDc2EoqMvj40mA1Sa1vnCHry" target="_blank" style="display: inline-block; vertical-align: middle;">
-    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
-  </a>
-</p>
-<p align="center"><br>
-  <img src="assets/downstream-deepglobe-roads.png" alt="Downstream Performance" style="width:85%;">
-</p>
-### 🏙️ Downstream : Building Footprint Validation
-To evaluate transferability to structural segmentation tasks, we tested `core-dino` on **building footprint extraction** using high-resolution satellite imagery.
-- **Setup:**
-  - Compared **YOLOv11-X (original weights)** vs. **YOLOv11-X initialized with `core-dino` weights**
-  - Used same training pipeline for both
-- **Result:**
-  - `core-dino` achieved **+15 mAP** improvement over standard YOLOv11-X
-  - Captures edge-localized and compact building structures better
-  - Demonstrates strong spatial precision and high-quality feature encoding
-<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
-  <span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span>
-  <a href="https://colab.research.google.com/drive/1uAqUNUDQt0_29Zhvopz0rWVSAzX-cZrk" target="_blank" style="display: inline-block; vertical-align: middle;">
-    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
-  </a>
-</p>
-<p align="center"><br>
-  <img src="assets/downstream-building-footprint.png" alt="Downstream Performance" style="width:85%;">
-</p>
----
-## 🗂️ Model Details
-| Field              | Value                                                        |
-|--------------------|--------------------------------------------------------------|
-| Parameters         | **56.7M**                                                    |
-| Backbone Architecture | **YOLOv11 X**                        |
-| Input Size         | **320 × 320 – 4096 × 4096**                                  |
-| Patch Source       | [Core-Five](https://huggingface.co/datasets/gajeshladhar/core-five) |
-| Resolutions        | 30 cm (clean) → 2 m (augmented)                              |
-| Patch Drop         | I-JEPA-style masking                                         |
-| Loss               | DINO contrastive loss                                        |
-| Training Time      | ~48h on 1×A100                                               |
----
-## 💳 License
-This project is released under the **[Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/)** license.
-> ✅ Free to use, share, and adapt for **non-commercial research**
-> ❌ **Commercial use is not permitted** without explicit permission
-> 📌 Please provide appropriate credit when using this dataset in publications or projects.

+---
+license: cc-by-nc-4.0
+---
+# 🌐 core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery
+[![🤗 Model Hub](https://img.shields.io/badge/HuggingFace-core--dino-blue?logo=huggingface&logoColor=white)](https://huggingface.co/gajeshladhar/core-dino)
+[![🚀 Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ)
+![🛰️ Domain](https://img.shields.io/badge/Domain-Earth%20Observation-green)
+![🔍 Task](https://img.shields.io/badge/Task-Self--Supervised--Learning-orange)
+---
+## 🔭 Overview
+`core-dino` is a resolution-agnostic **self-supervised model** designed for satellite imagery, trained on the [Core-Five dataset](https://huggingface.co/datasets/gajeshladhar/core-five) using a DiNO-inspired setup. It handles imagery between **20 cm and 2 m**, learning strong spatial features without any labels.
+<p>
+<a href="https://colab.research.google.com/drive/1JvSx0AERGWoc8vAZAxOSPOOG7PAh6KqZ" target="_blank">
+  <b>Open Demo ▶️</b>
+</a> - Run multi-resolution inference & visualize spatial embeddings.</p>
+---
+## 🌀 Quickstart
+```python
+from ultralytics import YOLO
+model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model
+ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt"
+ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu')
+model.model.load_state_dict(
+    {k.replace('layers.', 'model.'): v for k, v in ckpt.items()},
+    strict=False)
+```
+---
+## 🧠 Architecture: DiNO × YOLO × I-JEPA
+We combine three ideas to build a high-performance backbone for spatial representation learning:
+#### 1️⃣ **Multi-Resolution DINO Setup (instead of local-global)**
+> In standard [DINO](https://arxiv.org/abs/2104.14294) / [DINOv2](https://arxiv.org/abs/2304.07193), the student sees cropped or distorted views (local), while the teacher sees global views.
+> In `core-dino`, we replace this with **clean vs degraded resolution contrast**:
+- 👨‍🏫 **Teacher** gets clean 30 cm satellite imagery.
+- 👨‍🎓 **Student** sees augmented versions of the same scene at varying resolutions (30 cm → 2 m) with photometric and spatial distortions.
+This setup encourages the model to learn **scale-invariant** and **semantic-aware** features across real-world EO resolution gaps.
+#### 2️⃣ **I-JEPA-Style Patch Dropping**
+We integrate ideas from [I-JEPA](https://arxiv.org/abs/2301.08243):
+- Random **patch regions are dropped** from the student input.
+- The objective is to align the **visible patch embeddings** with the teacher’s corresponding high-resolution ones.
+- This enforces **local-global and partial-whole consistency** in the latent space.
+#### 3️⃣ **YOLOv11-X as Encoder Backbone**
+- We use **YOLOv11-X**, one of the most powerful and recent YOLO variants, as the spatial encoder.
+- The backbone is **truncated after 23 layers**, retaining rich spatial semantics while maintaining efficiency.
+- This provides strong priors from supervised detection tasks, now adapted for **self-supervised** learning.
+---
+## 🧪 Training Flow: Resolution-Agnostic DiNO
+The training pipeline in `core-dino` follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:
+#### 👨‍🏫  1. Teacher View (Clean & High-Res)
+- Receives a **clean 30 cm image** without any augmentation.
+- Used as the stable reference to guide the student.
+#### 👨‍🎓 2. Student View (Augmented Multi-Resolution)
+- Receives **randomly augmented** versions of the same image:
+  - Downsampled to **30 cm to 2 m**
+  - Augmented with noise, blur, color jitter, spatial dropout, etc.
+- Emulates resolution variability common in EO imagery.
+#### ⚠️ 3. Spatial Misalignment & Solution
+- Since different student resolutions produce different spatial dimensions (H × W),
+  we use **bilinear interpolation** to **resize the student’s feature map** to match the teacher's spatial shape before computing the contrastive loss.
+#### 🎯 4. Objective
+- Align the spatial token embeddings of the student with the teacher — pixel-to-pixel and semantically — despite resolution gaps and augmentations.
+- Encourages **scale-invariant**, **robust** feature learning across real-world variations.
+---
+## 📈 Performance: Latent Quality & Downstream Evaluation
+Despite being trained without any labels, `core-dino` demonstrates strong latent alignment and generalization capability — both in visual similarity and downstream tasks.
+### 🛣️ Downstream: Road Extraction (DeepGlobe Dataset)
+We evaluated `core-dino` on the [DeepGlobe Road Extraction Dataset](https://competitions.codalab.org/competitions/18467#learn_the_details), using it as a frozen backbone in a simple segmentation pipeline.
+- **Setup:**
+  - Both `core-dino` and **YOLOv11-X** backbones were **frozen**
+  - Only a **2-layer convolutional head** was trained
+  - Task: Binary road segmentation using IoU loss
+- **Result:**
+  - `core-dino` consistently outperformed the supervised **YOLOv11-X** backbone across all epochs
+  - Shows superior latent representation quality, even without task-specific supervision
+  - Demonstrates better **generalization** and **semantic robustness** in downstream transfer tasks
+<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
+  <span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span>
+  <a href="https://colab.research.google.com/drive/1JqJoboLljDc2EoqMvj40mA1Sa1vnCHry" target="_blank" style="display: inline-block; vertical-align: middle;">
+    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
+  </a>
+</p>
+<p align="center"><br>
+  <img src="assets/downstream-deepglobe-roads.png" alt="Downstream Performance" style="width:85%;">
+</p>
+### 🏙️ Downstream : Building Footprint Validation
+To evaluate transferability to structural segmentation tasks, we tested `core-dino` on **building footprint extraction** using high-resolution satellite imagery.
+- **Setup:**
+  - Compared **YOLOv11-X (original weights)** vs. **YOLOv11-X initialized with `core-dino` weights**
+  - Used same training pipeline for both
+- **Result:**
+  - `core-dino` achieved **+15 mAP** improvement over standard YOLOv11-X
+  - Captures edge-localized and compact building structures better
+  - Demonstrates strong spatial precision and high-quality feature encoding
+<p style="display: inline-flex; align-items: center; gap: 8px; margin-top: 10px;">
+  <span style="font-size: 16px;">📓 <strong>Reproduce this comparison in Colab:</strong></span>
+  <a href="https://colab.research.google.com/drive/1uAqUNUDQt0_29Zhvopz0rWVSAzX-cZrk" target="_blank" style="display: inline-block; vertical-align: middle;">
+    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab">
+  </a>
+</p>
+<p align="center"><br>
+  <img src="assets/downstream-building-footprint.png" alt="Downstream Performance" style="width:85%;">
+</p>
+---
+## 🗂️ Model Details
+| Field              | Value                                                        |
+|--------------------|--------------------------------------------------------------|
+| Parameters         | **56.7M**                                                    |
+| Backbone Architecture | **YOLOv11 X**                        |
+| Input Size         | **320 × 320 – 4096 × 4096**                                  |
+| Patch Source       | [Core-Five](https://huggingface.co/datasets/gajeshladhar/core-five) |
+| Resolutions        | 30 cm (clean) → 2 m (augmented)                              |
+| Patch Drop         | I-JEPA-style masking                                         |
+| Loss               | DINO contrastive loss                                        |
+| Training Time      | ~48h on 1×A100                                               |
+---
+## 💳 License
+This project is released under the **[Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0)](https://creativecommons.org/licenses/by-nc/3.0/)** license.
+> ✅ Free to use, share, and adapt for **non-commercial research**
+> ❌ **Commercial use is not permitted** without explicit permission
+> 📌 Please provide appropriate credit when using this dataset in publications or projects.