Papers
arxiv:2002.06289

3D Dynamic Scene Graphs: Actionable Spatial Perception with Places, Objects, and Humans

Published on Feb 15, 2020
Authors:
,
,
,
,

Abstract

We present a unified representation for actionable spatial perception: 3D Dynamic Scene Graphs. Scene graphs are directed graphs where nodes represent entities in the scene (e.g. objects, walls, rooms), and edges represent relations (e.g. inclusion, adjacency) among nodes. Dynamic scene graphs (DSGs) extend this notion to represent dynamic scenes with moving agents (e.g. humans, robots), and to include actionable information that supports planning and decision-making (e.g. spatio-temporal relations, topology at different levels of abstraction). Our second contribution is to provide the first fully automatic Spatial PerceptIon eNgine(SPIN) to build a DSG from visual-inertial data. We integrate state-of-the-art techniques for object and human detection and pose estimation, and we describe how to robustly infer object, robot, and human nodes in crowded scenes. To the best of our knowledge, this is the first paper that reconciles visual-inertial SLAM and dense human mesh tracking. Moreover, we provide algorithms to obtain hierarchical representations of indoor environments (e.g. places, structures, rooms) and their relations. Our third contribution is to demonstrate the proposed spatial perception engine in a photo-realistic Unity-based simulator, where we assess its robustness and expressiveness. Finally, we discuss the implications of our proposal on modern robotics applications. 3D Dynamic Scene Graphs can have a profound impact on planning and decision-making, human-robot interaction, long-term autonomy, and scene prediction. A video abstract is available at https://youtu.be/SWbofjhyPzI

Community

  • Introduces 3D Dynamic Scene Graphs (DSG) for spatial perception; SG nodes are scene entities (objects, walls, rooms) and edges are relations (inclusion, adjacency) - can be organized into a hierarchy; DSG has moving agents and relations include planning and decision making capabilities; proposes a spatial perception engine (SPIN) to build DSG from visual-inertial (VI - stereo and IMU) data (helping SLAM and navigation stack to get deeper knowledge about the environment). Can also track Skinned Multi-Person Linear Model (SMPL) human meshes.
  • DSG layers: Layer 1 stores Metric-Semantic Mesh (nodes are 3D points with position, normal, and panoptic segmentation label) of all static items; Layer 2 stores objects (objects like chair, table, etc. have pose, bounding box, and semantic class in node whereas edges have relations) and agents (dynamic entities like robots and humans have 3D pose graph - collection of time stamped 3D poses, mesh model, and semantic class as nodes); Layer 3 has places (positions in the free space of the room, collection of objects, edge connectivity represents topology maps for planning) and structures (structural elements and separators); Layer 4 has rooms (pose, bounding box, and semantic class as node, edges have adjacency); Layer 5 has building (edges connect all rooms). Design hand-made to handle planning queries and composition layers can be modified.
  • SPIN: Kimera gives semantically annotated mesh with Euclidean SDF (based on Voxblox); robot node uses Kimera-RPGO (IMU aware optical flow tracker with 2-point RANSAC for geometric verification). Human pose tracking/node uses graph CNN to get SMPL vertices, PnP to get full pose, and temporal consistency; remove dynamic elements from mesh reconstruction. Objects with unknown mesh: Euclidean clustering with PCL for getting instances (from semantic segmentation), get centroids, and then canonical axis aligned bounding box (AABB); for known shape (CAD model): do 3D Harris feature-based matching and TEASER++ for pose estimation/alignment (high outliers present). Topological graph obtained from ESDF (places); structures are from direct segmentation (AABB); Room detection: get 2D ESDF (slice just below ceiling), truncate it, label places within it (partitioning) with majority voting for boundary conditions, add edges between places and rooms.
  • Unity-based simulator experiments, SMPL provided simulated human models. Enhanced VIO better for dynamic scenarios, on-par in static. Dynamic masking gives better meshes and lower pose (localization) errors. Work can help bounding volume hierarchy (BVH); high-level object search, traversal, and planning; cross-referencing CAD models and control on granularity gives storage benefits for long-term mapping.
  • Dynamic Hierarchical Mapping with Scene Graphs (understanding) framework from MIT (LIDS, SPARK, Luca Carlone).

Links: MIT Blog, YouTube, PapersWithCode (method), GitHub (part of Kimera)

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2002.06289 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2002.06289 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2002.06289 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.