arxiv:2210.02493

Depth Is All You Need for Monocular 3D Detection

Published on Oct 5, 2022

Authors:

Abstract

A key contributor to recent progress in 3D detection from single images is monocular depth estimation. Existing methods focus on how to leverage depth explicitly, by generating pseudo-pointclouds or providing attention cues for image features. More recent works leverage depth prediction as a pretraining task and fine-tune the depth representation while training it for 3D detection. However, the adaptation is insufficient and is limited in scale by manual labels. In this work, we propose to further align depth representation with the target domain in unsupervised fashions. Our methods leverage commonly available LiDAR or RGB videos during training time to fine-tune the depth representation, which leads to improved 3D detectors. Especially when using RGB videos, we show that our two-stage training by first generating pseudo-depth labels is critical because of the inconsistency in loss distribution between the two tasks. With either type of reference data, our multi-task learning approach improves over the state of the art on both KITTI and NuScenes, while matching the test-time complexity of its single task sub-network.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2210.02493 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2210.02493 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2210.02493 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.