```yaml --- title: Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment emoji: ✨ colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 4.x app_file: app.py # Replace with your actual Gradio app file name if different tags: - 3d-mapping - indoor-reconstruction - depth-estimation - semantic-segmentation - vision-language-model - mobile-video - gradio - point-cloud - dpt - paligemma - computer-vision - research-paper license: mit --- # Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment Welcome to the Hugging Face Space for Duino-Idar, an innovative system transforming mobile video into interactive 3D room scans with semantic understanding. This Space provides access to the research paper, code snippets, and a conceptual demonstration outlining the system's capabilities. ## Abstract Duino-Idar presents an end-to-end pipeline for creating interactive 3D maps of indoor spaces from mobile video. It leverages state-of-the-art monocular depth estimation using DPT (Dense Prediction Transformer) models and enhances these geometric reconstructions with semantic context via a fine-tuned vision-language model, PaLiGemma. The system processes video by extracting key frames, estimating depth for each frame, constructing a 3D point cloud, and overlaying semantic labels. A user-friendly Gradio interface is designed for video upload, processing initiation, and interactive exploration of the resulting 3D scenes. This research details the system architecture, key mathematical formulations, implementation highlights, and potential applications in areas like augmented reality, indoor navigation, and automated scene understanding. Future work envisions incorporating LiDAR data for improved accuracy and real-time performance. ## Key Features * **Mobile Video Input:** Accepts standard mobile-recorded videos, making data capture straightforward and accessible. * **DPT-Based Depth Estimation:** Employs Dense Prediction Transformer (DPT) models for accurate depth inference from single video frames. * **PaLiGemma Semantic Enrichment:** Integrates a fine-tuned PaLiGemma vision-language model to provide semantic annotations, enriching the 3D scene with object labels and contextual understanding. * **Interactive 3D Point Clouds:** Generates 3D point clouds visualized with Open3D (or Plotly for web-based alternatives), allowing users to rotate, zoom, and pan through the reconstructed scene. * **Gradio Web Interface:** Features a user-friendly Gradio GUI for seamless video upload, processing, and interactive 3D model exploration directly in the browser. * **Mathematically Grounded Approach:** Based on established mathematical principles, including the pinhole camera model for 3D reconstruction and cross-entropy loss for vision-language model training. * **Open-Source Code Snippets:** Key code implementations for depth estimation, 3D reconstruction, and Gradio interface are provided for transparency and reproducibility. ## System Architecture Duino-Idar operates through a modular pipeline (detailed in the full paper): 1. **Video Processing & Frame Extraction:** Extracts representative key frames from the input video using OpenCV. 2. **Depth Estimation (DPT):** Utilizes a pre-trained DPT model from Hugging Face Transformers to predict depth maps from each extracted frame. 3. **3D Reconstruction (Pinhole Model):** Converts depth maps into 3D point clouds using a pinhole camera model approximation. 4. **Semantic Enrichment (PaLiGemma):** Employs a fine-tuned PaLiGemma model to generate semantic labels and scene descriptions for key frames. 5. **Interactive Visualization (Open3D/Gradio):** Visualizes the resulting semantically enhanced 3D point cloud within an interactive Gradio interface using Open3D for rendering. For a comprehensive understanding of the system architecture, refer to Section 3 of the full research paper linked below. ## Mathematical Foundations Duino-Idar's core components are underpinned by the following mathematical principles: **1. Depth Estimation & Normalization:** The depth map $D$ is predicted by the DPT model $f$: $D = f(I; \theta)$ Where $I$ is the input image and $\theta$ represents the model parameters. The depth map is then normalized: $D_{\text{norm}}(u,v) = \frac{D(u,v)}{\displaystyle \max_{(u,v)} D(u,v)}$ And optionally scaled to an 8-bit range for visualization: $D_{\text{scaled}}(u,v) = D_{\text{norm}}(u,v) \times 255$ **2. 3D Reconstruction with Pinhole Camera Model:** Using intrinsic camera parameters (focal lengths $f_x, f_y$ and principal point $c_x, c_y$), and depth $z(u,v)$ for a pixel $(u,v)$, the 3D coordinates $(x, y, z)$ are calculated as: $x = \frac{(u - c_x) \cdot z(u,v)}{f_x}$ $y = \frac{(v - c_y) \cdot z(u,v)}{f_y}$ $z = z(u,v)$ In matrix form: $\begin{pmatrix} x \\ y \\ z \end{pmatrix} = z(u,v) \cdot K^{-1} \begin{pmatrix} u \\ v \\ 1 \end{pmatrix}$ Where $K$ is the intrinsic matrix: $K = \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix}$ **3. Point Cloud Aggregation:** Point clouds $P_i$ from individual frames are aggregated into a final point cloud $P$: $P = \bigcup_{i=1}^{M} P_i$ **4. PaLiGemma Fine-Tuning Loss:** The PaLiGemma model is fine-tuned to minimize the cross-entropy loss $\mathcal{L}$ for predicting caption tokens $c = (c_1, ..., c_T)$ given an image $I$: $\mathcal{L} = -\sum_{t=1}^{T} \log P(c_t \mid c_{