π core-dino | Resolution-Agnostic Self-Supervised Learning on Satellite Imagery
π Overview
core-dino
is a resolution-agnostic self-supervised model designed for satellite imagery, trained on the Core-Five dataset using a DiNO-inspired setup. It handles imagery between 20β―cm and 2β―m, learning strong spatial features without any labels.
Open Demo βΆοΈ - Run multi-resolution inference & visualize spatial embeddings.
π Quickstart
from ultralytics import YOLO
model = YOLO("yolo11x-obb.pt") # obb, bbox or seg any model
ckpt = "https://huggingface.co/gajeshladhar/core-dino/resolve/main/checkpoints/student.pt"
ckpt = torch.hub.load_state_dict_from_url(ckpt, map_location='cpu')
model.model.load_state_dict(
{k.replace('layers.', 'model.'): v for k, v in ckpt.items()},
strict=False)
π§ Architecture: DiNO Γ YOLO Γ I-JEPA
We combine three ideas to build a high-performance backbone for spatial representation learning:
1οΈβ£ Multi-Resolution DINO Setup (instead of local-global)
In standard DINO / DINOv2, the student sees cropped or distorted views (local), while the teacher sees global views.
Incore-dino
, we replace this with clean vs degraded resolution contrast:
- π¨βπ« Teacher gets clean 30β―cm satellite imagery.
- π¨βπ Student sees augmented versions of the same scene at varying resolutions (30β―cm β 2β―m) with photometric and spatial distortions.
This setup encourages the model to learn scale-invariant and semantic-aware features across real-world EO resolution gaps.
2οΈβ£ I-JEPA-Style Patch Dropping
We integrate ideas from I-JEPA:
- Random patch regions are dropped from the student input.
- The objective is to align the visible patch embeddings with the teacherβs corresponding high-resolution ones.
- This enforces local-global and partial-whole consistency in the latent space.
3οΈβ£ YOLOv11-X as Encoder Backbone
- We use YOLOv11-X, one of the most powerful and recent YOLO variants, as the spatial encoder.
- The backbone is truncated after 23 layers, retaining rich spatial semantics while maintaining efficiency.
- This provides strong priors from supervised detection tasks, now adapted for self-supervised learning.
π§ͺ Training Flow: Resolution-Agnostic DiNO
The training pipeline in core-dino
follows a student-teacher design inspired by DINO, but adapted for real-world satellite imagery:
π¨βπ« 1. Teacher View (Clean & High-Res)
- Receives a clean 30β―cm image without any augmentation.
- Used as the stable reference to guide the student.
π¨βπ 2. Student View (Augmented Multi-Resolution)
- Receives randomly augmented versions of the same image:
- Downsampled to 30β―cm to 2β―m
- Augmented with noise, blur, color jitter, spatial dropout, etc.
- Emulates resolution variability common in EO imagery.
β οΈ 3. Spatial Misalignment & Solution
- Since different student resolutions produce different spatial dimensions (H Γ W),
we use bilinear interpolation to resize the studentβs feature map to match the teacher's spatial shape before computing the contrastive loss.
π― 4. Objective
- Align the spatial token embeddings of the student with the teacher β pixel-to-pixel and semantically β despite resolution gaps and augmentations.
- Encourages scale-invariant, robust feature learning across real-world variations.
π Performance: Latent Quality & Downstream Evaluation
Despite being trained without any labels, core-dino
demonstrates strong latent alignment and generalization capability β both in visual similarity and downstream tasks.
π£οΈ Downstream: Road Extraction (DeepGlobe Dataset)
We evaluated core-dino
on the DeepGlobe Road Extraction Dataset, using it as a frozen backbone in a simple segmentation pipeline.
Setup:
- Both
core-dino
and YOLOv11-X backbones were frozen - Only a 2-layer convolutional head was trained
- Task: Binary road segmentation using IoU loss
- Both
Result:
core-dino
consistently outperformed the supervised YOLOv11-X backbone across all epochs- Shows superior latent representation quality, even without task-specific supervision
- Demonstrates better generalization and semantic robustness in downstream transfer tasks
π Reproduce this comparison in Colab:
ποΈ Downstream : Building Footprint Validation
To evaluate transferability to structural segmentation tasks, we tested core-dino
on building footprint extraction using high-resolution satellite imagery.
Setup:
- Compared YOLOv11-X (original weights) vs. YOLOv11-X initialized with
core-dino
weights - Used same training pipeline for both
- Compared YOLOv11-X (original weights) vs. YOLOv11-X initialized with
Result:
core-dino
achieved +15 mAP improvement over standard YOLOv11-X- Captures edge-localized and compact building structures better
- Demonstrates strong spatial precision and high-quality feature encoding
π Reproduce this comparison in Colab:
ποΈ Model Details
Field | Value |
---|---|
Parameters | 56.7M |
Backbone Architecture | YOLOv11 X |
Input Size | 320 Γ 320 β 4096 Γ 4096 |
Patch Source | Core-Five |
Resolutions | 30β―cm (clean) β 2β―m (augmented) |
Patch Drop | I-JEPA-style masking |
Loss | DINO contrastive loss |
Training Time | ~48h on 1ΓA100 |
π³ License
This project is released under the Creative Commons Attribution-NonCommercial 3.0 Unported (CC BY-NC 3.0) license.
β Free to use, share, and adapt for non-commercial research
β Commercial use is not permitted without explicit permission
π Please provide appropriate credit when using this dataset in publications or projects.