```yaml --- title: Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment emoji: ✨ colorFrom: indigo colorTo: blue sdk: gradio sdk_version: 4.x app_file: app.py # Replace with your actual Gradio app file if you have one tags: - 3d-mapping - indoor-reconstruction - depth-estimation - semantic-segmentation - vision-language-model - mobile-video - gradio - point-cloud - dpt - paligemma - computer-vision - research-paper author: Jalal Mansour (Jalal Duino) date: 2025-02-18 email: Jalalmansour663@gmail.com hf_space: Duino # Assuming 'Duino' is your HF username/space name license: mit --- # Duino-Idar: Interactive Indoor 3D Mapping via Mobile Video with Semantic Enrichment **Author:** Jalal Mansour (Jalal Duino) **Date Created:** 2025-02-18 **Email:** [Jalalmansour663@gmail.com](mailto:Jalalmansour663@gmail.com) **Hugging Face Space:** [https://huggingface.co/Duino](https://huggingface.co/Duino) **License:** MIT License --- ## Abstract This paper introduces Duino-Idar, a novel end-to-end system for generating interactive 3D maps of indoor environments from mobile video. Leveraging state-of-the-art monocular depth estimation techniques, specifically DPT (Dense Prediction Transformer)-based models, and semantic understanding via a fine-tuned vision-language model (PaLiGemma), Duino-Idar offers a comprehensive solution for indoor scene reconstruction. The system extracts key frames from video input, computes depth maps, constructs a 3D point cloud, and enriches it with semantic labels. A user-friendly Gradio-based graphical user interface (GUI) facilitates video upload, processing, and interactive 3D scene exploration. This research details the system's architecture, implementation, and potential applications in areas such as indoor navigation, augmented reality, and automated scene understanding, setting the stage for future enhancements including LiDAR integration for improved accuracy and robustness. **Keywords:** 3D Mapping, Indoor Reconstruction, Mobile Video, Depth Estimation, Semantic Segmentation, Vision-Language Models, DPT, PaLiGemma, Point Cloud, Gradio, Interactive Visualization. --- ## 1. Introduction Recent advancements in computer vision and deep learning have significantly propelled the field of 3D scene reconstruction from 2D imagery. Mobile devices, now ubiquitous and equipped with high-quality cameras, provide a readily available source of video data suitable for spatial mapping. While monocular depth estimation has matured considerably, enabling real-time applications, many existing 3D reconstruction approaches lack a crucial component: semantic understanding of the scene. This semantic context is vital for enabling truly interactive and context-aware applications, such as augmented reality (AR) navigation, object recognition, and scene understanding for robotic systems. To address this gap, we present Duino-Idar, an innovative system that integrates a robust depth estimation pipeline with a fine-tuned vision-language model, PaLiGemma, to enhance indoor 3D mapping. The system's name, Duino-Idar, reflects the vision of combining accessible technology ("Duino," referencing approachability and user-centric design) with advanced spatial sensing ("Idar," hinting at the potential for LiDAR integration in future iterations, although the current prototype focuses on vision-based depth). This synergistic combination not only achieves geometric reconstruction but also provides semantic enrichment, significantly enhancing both visualization and user interaction capabilities. This paper details the architecture, implementation, and potential of Duino-Idar, highlighting its contribution to accessible and semantically rich indoor 3D mapping. --- ## 2. Related Work Our work builds upon and integrates several key areas of research: ### 2.1 Monocular Depth Estimation: The foundation of our geometric reconstruction lies in monocular depth estimation. Models such as MiDaS [1] and DPT [2] have demonstrated remarkable capabilities in inferring depth from single images. DPT, in particular, leverages transformer architectures to capture global contextual information, leading to improved depth accuracy compared to earlier convolutional neural network (CNN)-based methods. Equation (1) illustrates the depth normalization process used in DPT-like models to scale the predicted depth map to a usable range. ### 2.2 3D Reconstruction Techniques: Generating 3D point clouds or meshes from 2D inputs is a well-established field, encompassing techniques from photogrammetry [3] and Simultaneous Localization and Mapping (SLAM) [4]. Our approach utilizes depth maps derived from DPT to construct a point cloud, offering a simpler yet effective method for 3D scene representation, particularly suitable for indoor environments where texture and feature richness can support monocular depth estimation. The transformation from 2D pixel coordinates to 3D space is mathematically described by the pinhole camera model, as shown in Equations (2)-(4). ### 2.3 Vision-Language Models for Semantic Understanding: Vision-language models (VLMs) have emerged as powerful tools for bridging the gap between visual and textual understanding. PaLiGemma [5] is a state-of-the-art multimodal model that integrates image understanding with natural language processing. Fine-tuning such models on domain-specific datasets, such as indoor scenes, allows for the generation of semantic annotations and descriptions that can be overlaid on reconstructed 3D models, enriching them with contextual information. The fine-tuning process for PaLiGemma, aimed at minimizing the token prediction loss, is formalized in Equation (5). ### 2.4 Interactive 3D Visualization: Effective visualization is crucial for user interaction with 3D data. Libraries like Open3D [6] and Plotly [7] provide tools for interactive exploration of 3D point clouds and meshes. Open3D, in particular, offers robust functionalities for point cloud manipulation, rendering, and visualization, making it an ideal choice for desktop-based interactive 3D scene exploration. For web-based interaction, Plotly offers excellent capabilities for embedding interactive 3D visualizations within web applications. --- ## 3. System Architecture: Duino-Idar Pipeline ### 3.1 Overview The Duino-Idar system is structured into three primary modules, as illustrated in Figure 1: 1. **Video Processing and Frame Extraction:** This module ingests mobile video input and extracts representative key frames at configurable intervals to reduce computational redundancy and capture scene changes effectively. 2. **Depth Estimation and 3D Reconstruction:** Each extracted frame is processed by a DPT-based depth estimator to generate a depth map. These depth maps are then converted into 3D point clouds using a pinhole camera model, transforming 2D pixel coordinates into 3D spatial positions. 3. **Semantic Enrichment and Visualization:** A fine-tuned PaLiGemma model provides semantic annotations for the extracted key frames, enriching the 3D reconstruction with object labels and scene descriptions. A Gradio-based GUI integrates these modules, providing a user-friendly interface for video upload, processing, interactive 3D visualization, and exploration of the semantically enhanced 3D scene. ```mermaid graph LR A[Mobile Video Input] --> B(Video Processing & Frame Extraction); B --> C(Depth Estimation (DPT)); C --> D(3D Reconstruction (Pinhole Model)); B --> E(Semantic Enrichment (PaLiGemma)); D --> F(Point Cloud); E --> G(Semantic Labels); F & G --> H(Semantic Integration); H --> I(Interactive 3D Viewer (Open3D/Plotly)); I --> J[Gradio GUI & User Interaction]; style B fill:#f9f,stroke:#333,stroke-width:2px style C fill:#ccf,stroke:#333,stroke-width:2px style D fill:#fcc,stroke:#333,stroke-width:2px style E fill:#cfc,stroke:#333,stroke-width:2px style H fill:#eee,stroke:#333,stroke-width:2px style I fill:#ace,stroke:#333,stroke-width:2px ``` *Figure 1: Duino-Idar System Architecture. The diagram illustrates the flow of data through the system modules, from video input to interactive 3D visualization with semantic enrichment.* ### 3.2 Detailed Pipeline The Duino-Idar pipeline operates through the following detailed steps: 1. **Input Module:** * **Video Upload:** Users initiate the process by uploading a mobile-recorded video via the Gradio web interface. * **Frame Extraction:** OpenCV is employed to extract frames from the uploaded video at regular, user-configurable intervals. This interval determines the density of key frames used for reconstruction, balancing computational cost with scene detail. 2. **Depth Estimation Module:** * **Preprocessing:** Each extracted frame undergoes preprocessing, including resizing and normalization, to optimize it for input to the DPT model. This ensures consistent input dimensions and value ranges for the depth estimation network. * **Depth Prediction:** The preprocessed frame is fed into the DPT model, which generates a depth map. This depth map represents the estimated distance of each pixel in the image from the camera. * **Normalization and Scaling:** The raw depth map is normalized to a standard range (e.g., 0-1 or 0-255) for subsequent 3D reconstruction and visualization. Equation (1) details the normalization process. 3. **3D Reconstruction Module:** * **Point Cloud Generation:** A pinhole camera model is applied to convert the depth map and corresponding pixel coordinates into 3D coordinates in camera space. Color information from the original frame is associated with each 3D point to create a colored point cloud. Equations (2), (3), and (4) formalize this transformation. * **Point Cloud Aggregation:** To build a comprehensive 3D model, point clouds generated from multiple key frames are aggregated. In this initial implementation, we assume a static camera or negligible inter-frame motion for simplicity. More advanced implementations could incorporate camera pose estimation and point cloud registration for improved accuracy, especially in dynamic scenes. The aggregation process is mathematically represented by Equation (4). 4. **Semantic Enhancement Module:** * **Vision-Language Processing:** The fine-tuned PaLiGemma model processes the key frames to generate scene descriptions and semantic labels. The model is prompted to identify objects and provide contextual information relevant to indoor scenes. * **Semantic Data Integration:** Semantic labels generated by PaLiGemma are overlaid onto the reconstructed point cloud. This integration can be achieved through various methods, such as associating semantic labels with clusters of points or generating bounding boxes around semantically labeled objects within the 3D scene. 5. **Visualization and User Interface Module:** * **Interactive 3D Viewer:** The final semantically enriched 3D model is visualized using Open3D (or Plotly for web-based deployments). Users can interact with the 3D scene, rotating, zooming, and panning to explore the reconstructed environment. * **Gradio GUI:** A user-friendly Gradio web interface provides a seamless experience, allowing users to upload videos, initiate the processing pipeline, and interactively navigate the resulting 3D scene. The GUI also provides controls for adjusting parameters like frame extraction interval and potentially visualizing semantic labels. --- ## 4. Mathematical Foundations and Implementation Details ### 4.1 Mathematical Framework The Duino-Idar system relies on several core mathematical principles: **1. Depth Estimation via Deep Network:** Let $I \in \mathbb{R}^{H \times W \times 3}$ represent the input image of height $H$ and width $W$. The DPT model, denoted as $f$, with learnable parameters $\theta$, estimates the depth map $D$: **(1) $D = f(I; \theta)$** The depth map $D$ is then normalized to obtain $D_{\text{norm}}$: **(2) $D_{\text{norm}}(u,v) = \frac{D(u,v)}{\displaystyle \max_{(u,v)} D(u,v)}$** If a maximum physical depth $Z_{\max}$ is assumed, the scaled depth $z(u,v)$ is: **(3) $z(u,v) = D_{\text{norm}}(u,v) \times Z_{\max}$** For practical implementation and visualization, we often scale the depth to an 8-bit range: **(4) $D_{\text{scaled}}(u,v) = \frac{D(u,v)}{\displaystyle \max_{(u,v)} D(u,v)} \times 255$** **2. 3D Reconstruction with Pinhole Camera Model:** Assuming a pinhole camera model with intrinsic parameters: focal lengths $(f_x, f_y)$ and principal point $(c_x, c_y)$, the intrinsic matrix $K$ is: **(5) $K = \begin{pmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{pmatrix}$** Given a pixel $(u, v)$ and its depth value $z(u,v)$, the 3D coordinates $(x, y, z)$ in the camera coordinate system are: **(6) $x = \frac{(u - c_x) \cdot z(u,v)}{f_x}$** **(7) $y = \frac{(v - c_y) \cdot z(u,v)}{f_y}$** **(8) $z = z(u,v)$** In matrix form: **(9) $\begin{pmatrix} x \\ y \\ z \end{pmatrix} = z(u,v) \cdot K^{-1} \begin{pmatrix} u \\ v \\ 1 \end{pmatrix}$** **3. Aggregation of Multiple Frames:** Let $P_i$ be the point cloud from the $i^{th}$ frame, where $P_i = \{(x_{i,j}, y_{i,j}, z_{i,j}) \mid j = 1, 2, \ldots, N_i\}$. The overall point cloud $P$ is the union: **(10) $P = \bigcup_{i=1}^{M} P_i$** where $M$ is the number of frames. **4. Fine-Tuning PaLiGemma Loss:** For fine-tuning PaLiGemma, given an image $I$ and caption tokens $c = (c_1, c_2, \ldots, c_T)$, the cross-entropy loss $\mathcal{L}$ is minimized: **(11) $\mathcal{L} = -\sum_{t=1}^{T} \log P(c_t \mid c_{