Papers
arxiv:2306.06289

SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers

Published on Jun 9, 2023
Authors:
,
,
,
,
,

Abstract

This paper investigates the capability of plain Vision Transformers (ViTs) for semantic segmentation using the encoder-decoder framework and introduces SegViTv2. In this study, we introduce a novel Attention-to-Mask (\atm) module to design a lightweight decoder effective for plain ViT. The proposed ATM converts the global attention map into semantic masks for high-quality segmentation results. Our decoder outperforms the popular decoder UPerNet using various ViT backbones while consuming only about 5% of the computational cost. For the encoder, we address the concern of the relatively high computational cost in the ViT-based encoders and propose a Shrunk++ structure that incorporates edge-aware query-based down-sampling (EQD) and query-based upsampling (QU) modules. The Shrunk++ structure reduces the computational cost of the encoder by up to 50% while maintaining competitive performance. Furthermore, we propose to adapt SegViT for continual semantic segmentation, demonstrating nearly zero forgetting of previously learned knowledge. Experiments show that our proposed SegViTv2 surpasses recent segmentation methods on three popular benchmarks including ADE20k, COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the following link: https://github.com/zbwxp/SegVit.

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2306.06289 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2306.06289 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.