🌐 core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery

🔭 Overview

core-dino is a resolution-agnostic self-supervised model designed for satellite imagery, trained on the Core-Five dataset using a DiNO-inspired setup. It handles imagery between 20 cm and 2 m, learning strong spatial features without any labels.

Open Demo ▶️ - Run multi-resolution inference & visualize spatial embeddings.

🌀 Quickstart

from ultralytics import YOLO

model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model
ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt"
ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu')
model.model.load_state_dict(
    {k.replace('layers.', 'model.'): v for k, v in ckpt.items()},
    strict=False)

🧠 Architecture: DiNO × YOLO × I-JEPA

We combine three ideas to build a high-performance backbone for spatial representation learning:

1️⃣ Multi-Resolution DINO Setup (instead of local-global)

In standard DINO / DINOv2, the student sees cropped or distorted views (local), while the teacher sees global views.
In core-dino, we replace this with clean vs degraded resolution contrast:

👨‍🏫 Teacher gets clean 30 cm satellite imagery.
👨‍🎓 Student sees augmented versions of the same scene at varying resolutions (30 cm → 2 m) with photometric and spatial distortions.

This setup encourages the model to learn scale-invariant and semantic-aware features across real-world EO resolution gaps.

2️⃣ I-JEPA-Style Patch Dropping

We integrate ideas from I-JEPA:

Random patch regions are dropped from the student input.
The objective is to align the visible patch embeddings with the teacher’s corresponding high-resolution ones.
This enforces local-global and partial-whole consistency in the latent space.

3️⃣ YOLOv11-X as Encoder Backbone

We use YOLOv11-X, one of the most powerful and recent YOLO variants, as the spatial encoder.
The backbone is truncated after 23 layers, retaining rich spatial semantics while maintaining efficiency.
This provides strong priors from supervised detection tasks, now adapted for self-supervised learning.

🧪 Training Flow: Resolution-Agnostic DiNO

The training pipeline in core-dino follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:

👨‍🏫 1. Teacher View (Clean & High-Res)

Receives a clean 30 cm image without any augmentation.
Used as the stable reference to guide the student.

👨‍🎓 2. Student View (Augmented Multi-Resolution)

Receives randomly augmented versions of the same image:
- Downsampled to 30 cm to 2 m
- Augmented with noise, blur, color jitter, spatial dropout, etc.
Emulates resolution variability common in EO imagery.

⚠️ 3. Spatial Misalignment & Solution

Since different student resolutions produce different spatial dimensions (H × W),
we use bilinear interpolation to resize the student’s feature map to match the teacher's spatial shape before computing the contrastive loss.

🎯 4. Objective

Align the spatial token embeddings of the student with the teacher — pixel-to-pixel and semantically — despite resolution gaps and augmentations.
Encourages scale-invariant, robust feature learning across real-world variations.

📈 Performance: Latent Quality & Downstream Evaluation

Despite being trained without any labels, core-dino demonstrates strong latent alignment and generalization capability — both in visual similarity and downstream tasks.

🛣️ Downstream: Road Extraction (DeepGlobe Dataset)

We evaluated core-dino on the DeepGlobe Road Extraction Dataset, using it as a frozen backbone in a simple segmentation pipeline.

Setup:
- Both core-dino and YOLOv11-X backbones were frozen
- Only a 2-layer convolutional head was trained
- Task: Binary road segmentation using IoU loss
Result:
- core-dino consistently outperformed the supervised YOLOv11-X backbone across all epochs
- Shows superior latent representation quality, even without task-specific supervision
- Demonstrates better generalization and semantic robustness in downstream transfer tasks

📓 Reproduce this comparison in Colab:

Downstream Performance

🏙️ Downstream : Building Footprint Validation

To evaluate transferability to structural segmentation tasks, we tested core-dino on building footprint extraction using high-resolution satellite imagery.

Setup:
- Compared YOLOv11-X (original weights) vs. YOLOv11-X initialized with core-dino weights
- Used same training pipeline for both
Result:
- core-dino achieved +15 mAP improvement over standard YOLOv11-X
- Captures edge-localized and compact building structures better
- Demonstrates strong spatial precision and high-quality feature encoding

📓 Reproduce this comparison in Colab:

Downstream Performance

🗂️ Model Details

Field	Value
Parameters	56.7M
Backbone Architecture	YOLOv11 X
Input Size	320 × 320 – 4096 × 4096
Patch Source	Core-Five
Resolutions	30 cm (clean) → 2 m (augmented)
Patch Drop	I-JEPA-style masking
Loss	DINO contrastive loss
Training Time	~48h on 1×A100

💳 License

This project is released under the Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0) license.

✅ Free to use, share, and adapt for non-commercial research
❌ Commercial use is not permitted without explicit permission
📌 Please provide appropriate credit when using this dataset in publications or projects.