File size: 10,467 Bytes
a5bebc1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
```yaml
---
title: Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment
emoji: ✨
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 4.x
app_file: app.py  # Replace with your actual Gradio app file name if different
tags:
- 3d-mapping
- indoor-reconstruction
- depth-estimation
- semantic-segmentation
- vision-language-model
- mobile-video
- gradio
- point-cloud
- dpt
- paligemma
- computer-vision
- research-paper
license: mit
---

# Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment

Welcome to the Hugging Face Space for Duino-Idar, an innovative system transforming mobile video into interactive 3D room scans with semantic understanding. This Space provides access to the research paper, code snippets, and a conceptual demonstration outlining the system's capabilities.

## Abstract

Duino-Idar presents an end-to-end pipeline for creating interactive 3D maps of indoor spaces from mobile video. It leverages state-of-the-art monocular depth estimation using DPT (Dense Prediction Transformer) models and enhances these geometric reconstructions with semantic context via a fine-tuned vision-language model, PaLiGemma. The system processes video by extracting key frames, estimating depth for each frame, constructing a 3D point cloud, and overlaying semantic labels. A user-friendly Gradio interface is designed for video upload, processing initiation, and interactive exploration of the resulting 3D scenes. This research details the system architecture, key mathematical formulations, implementation highlights, and potential applications in areas like augmented reality, indoor navigation, and automated scene understanding. Future work envisions incorporating LiDAR data for improved accuracy and real-time performance.

## Key Features

*   **Mobile Video Input:** Accepts standard mobile-recorded videos, making data capture straightforward and accessible.
*   **DPT-Based Depth Estimation:** Employs Dense Prediction Transformer (DPT) models for accurate depth inference from single video frames.
*   **PaLiGemma Semantic Enrichment:** Integrates a fine-tuned PaLiGemma vision-language model to provide semantic annotations, enriching the 3D scene with object labels and contextual understanding.
*   **Interactive 3D Point Clouds:** Generates 3D point clouds visualized with Open3D (or Plotly for web-based alternatives), allowing users to rotate, zoom, and pan through the reconstructed scene.
*   **Gradio Web Interface:** Features a user-friendly Gradio GUI for seamless video upload, processing, and interactive 3D model exploration directly in the browser.
*   **Mathematically Grounded Approach:** Based on established mathematical principles, including the pinhole camera model for 3D reconstruction and cross-entropy loss for vision-language model training.
*   **Open-Source Code Snippets:** Key code implementations for depth estimation, 3D reconstruction, and Gradio interface are provided for transparency and reproducibility.

## System Architecture

Duino-Idar operates through a modular pipeline (detailed in the full paper):

1.  **Video Processing & Frame Extraction:** Extracts representative key frames from the input video using OpenCV.
2.  **Depth Estimation (DPT):**  Utilizes a pre-trained DPT model from Hugging Face Transformers to predict depth maps from each extracted frame.
3.  **3D Reconstruction (Pinhole Model):** Converts depth maps into 3D point clouds using a pinhole camera model approximation.
4.  **Semantic Enrichment (PaLiGemma):** Employs a fine-tuned PaLiGemma model to generate semantic labels and scene descriptions for key frames.
5.  **Interactive Visualization (Open3D/Gradio):**  Visualizes the resulting semantically enhanced 3D point cloud within an interactive Gradio interface using Open3D for rendering.

For a comprehensive understanding of the system architecture, refer to Section 3 of the full research paper linked below.

## Mathematical Foundations

Duino-Idar's core components are underpinned by the following mathematical principles:

**1. Depth Estimation & Normalization:**

The depth map $D$ is predicted by the DPT model $f$:

$D = f(I; \theta)$

Where $I$ is the input image and $\theta$ represents the model parameters. The depth map is then normalized:

$D_{\text{norm}}(u,v) = \frac{D(u,v)}{\displaystyle \max_{(u,v)} D(u,v)}$

And optionally scaled to an 8-bit range for visualization:

$D_{\text{scaled}}(u,v) = D_{\text{norm}}(u,v) \times 255$

**2. 3D Reconstruction with Pinhole Camera Model:**

Using intrinsic camera parameters (focal lengths $f_x, f_y$ and principal point $c_x, c_y$), and depth $z(u,v)$ for a pixel $(u,v)$, the 3D coordinates $(x, y, z)$ are calculated as:

$x = \frac{(u - c_x) \cdot z(u,v)}{f_x}$

$y = \frac{(v - c_y) \cdot z(u,v)}{f_y}$

$z = z(u,v)$

In matrix form:

$\begin{pmatrix} x \\ y \\ z \end{pmatrix} = z(u,v) \cdot K^{-1} \begin{pmatrix} u \\ v \\ 1 \end{pmatrix}$

Where $K$ is the intrinsic matrix:

$K = \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix}$

**3. Point Cloud Aggregation:**

Point clouds $P_i$ from individual frames are aggregated into a final point cloud $P$:

$P = \bigcup_{i=1}^{M} P_i$

**4. PaLiGemma Fine-Tuning Loss:**

The PaLiGemma model is fine-tuned to minimize the cross-entropy loss $\mathcal{L}$ for predicting caption tokens $c = (c_1, ..., c_T)$ given an image $I$:

$\mathcal{L} = -\sum_{t=1}^{T} \log P(c_t \mid c_{<t}, I)$

For a more detailed mathematical treatment, please refer to Section 4.1 of the full research paper.

## Code Snippets

Here are key Python code snippets illustrating the core functionalities of Duino-Idar:

**1. Depth Estimation with DPT (using Hugging Face Transformers):**

```python
import torch
from transformers import DPTFeatureExtractor, DPTForDepthEstimation
from PIL import Image
import numpy as np

dpt_model = DPTForDepthEstimation.from_pretrained("Intel/dpt-large")
feature_extractor = DPTFeatureExtractor.from_pretrained("Intel/dpt-large")

def estimate_depth(image):
    inputs = feature_extractor(images=image, return_tensors="pt")
    with torch.no_grad():
        depth_map = dpt_model(**inputs).predicted_depth.squeeze().numpy()
    depth_map = (depth_map / np.max(depth_map) * 255).astype(np.uint8) # Normalize
    return depth_map

# Example usage:
image = Image.open("example_frame.jpg") # Replace with your image path
depth_map = estimate_depth(image)
# depth_map is a NumPy array representing the depth
```

**2. 3D Point Cloud Reconstruction (using Open3D):**

```python
import open3d as o3d
import numpy as np

def reconstruct_3d(depth_map, image):
    h, w = depth_map.shape
    fx = fy = max(h, w) / 2.0 # Approximate intrinsics
    cx, cy = w / 2.0, h / 2.0
    points = []
    colors = []
    image_np = np.array(image) / 255.0

    for v in range(h):
        for u in range(w):
            z = depth_map[v, u] / 255.0 * 5.0 # Scaled depth
            x = (u - cx) * z / fx
            y = (v - cy) * z / fy
            points.append([x, y, z])
            colors.append(image_np[v, u])

    pcd = o3d.geometry.PointCloud()
    pcd.points = o3d.utility.Vector3dVector(np.array(points))
    pcd.colors = o3d.utility.Vector3dVector(np.array(colors))
    return pcd

# Example usage:
# point_cloud = reconstruct_3d(depth_map, image)
# o3d.io.write_point_cloud("output.ply", point_cloud)
```

**3. Basic Gradio Interface (for demonstration - full interface in paper):**

```python
import gradio as gr
import open3d as o3d

def visualize_3d_model(ply_file):
    pcd = o3d.io.read_point_cloud(ply_file)
    o3d.visualization.draw_geometries([pcd])

with gr.Blocks() as demo:
    gr.Markdown("### Duino-Idar 3D Mapping Demo")
    video_input = gr.Video(label="Upload Video", type="filepath")
    process_btn = gr.Button("Process & Visualize")
    # ... (Integration of processing functions would go here in a full app) ...

    process_btn.click(fn=lambda x: "output.ply", inputs=video_input, outputs=None) # Placeholder function

demo.launch()
```

These snippets provide a glimpse into the dynamic code implementation of Duino-Idar. For the complete implementation and Gradio application, please refer to the code examples within the full research paper (Section 4.3).

## Interactive Demo

[**Conceptual Gradio Demo Space (Under Development)**](https://huggingface.co/spaces/Duino/duino-idar)  **(Replace with your actual Space URL once created)**

While a fully interactive demo is currently under development, the linked Space will eventually host a Gradio application allowing you to upload your own mobile videos and visualize the generated 3D point clouds. Please check back for updates!

## Future Work

Future development of Duino-Idar will focus on:

*   **Enhanced Semantic Integration:**  Implementing robust semantic label overlay directly onto the point cloud geometry.
*   **Multi-Frame Fusion & SLAM:** Incorporating SLAM or multi-view stereo techniques for improved reconstruction accuracy and handling camera motion.
*   **LiDAR Integration (Duino-*Idar* Vision):** Exploring the fusion of LiDAR data to complement video-based depth estimation for greater precision and robustness.
*   **Real-Time Performance Optimization:** Optimizing the pipeline for real-time or near-real-time 3D mapping on mobile platforms.
*   **Advanced User Interface:** Developing a more immersive and feature-rich user interface for interactive 3D scene exploration and manipulation.

## Citation

If you utilize Duino-Idar in your research, please cite the following:

```bibtex
@misc{duino-idar-2024,
  author = {Jalal Mansour (Jalal Duino)},
  title = {Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment},
  year = {20245},
  publisher = {Hugging Face Space},
  howpublished = {Online},
  url = {https://huggingface.co/spaces/Duino/duino-idar}
}
```

## Contact

For inquiries, collaborations, or further information, please contact:

Jalal Mansour (Jalal Duino) - [[email protected]](mailto:[email protected])

## Full Research Paper

[**Link to Full Research Paper (PDF)**](Link to your full paper PDF here - e.g., Google Drive, Dropbox, personal website)  **(Replace with actual link to your paper)**

This README.md provides a comprehensive overview of the Duino-Idar project. We encourage you to read the full research paper for in-depth details and stay tuned for updates on the interactive demo Space!