Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
Abstract
In this paper, we present an open-set object detector, called Grounding <PRE_TAG>DINO</POST_TAG>, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding <PRE_TAG>DINO</POST_TAG> performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and Ref<PRE_TAG>COCO/+/g</POST_TAG>. Grounding <PRE_TAG>DINO</POST_TAG> achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean 26.1 AP. Code will be available at https://github.com/IDEA-Research/GroundingDINO.
Models citing this paper 9
Browse 9 models citing this paperDatasets citing this paper 3
Spaces citing this paper 126
Collections including this paper 0
No Collection including this paper