ViTMatte: Boosting Image Matting with Pretrained Plain Vision Transformers
Abstract
Recently, plain <PRE_TAG>vision Transformers (ViTs)</POST_TAG> have shown impressive performance on various computer vision tasks, thanks to their strong modeling capacity and large-scale pretraining. However, they have not yet conquered the problem of <PRE_TAG>image matting</POST_TAG>. We hypothesize that <PRE_TAG>image matting</POST_TAG> could also be boosted by ViTs and present a new efficient and robust ViT-based matting system, named ViTMatte. Our method utilizes (i) a <PRE_TAG>hybrid attention mechanism</POST_TAG> combined with a <PRE_TAG>convolution</POST_TAG> neck to help ViTs achieve an excellent performance-computation trade-off in matting tasks. (ii) Additionally, we introduce the detail capture module, which just consists of simple <PRE_TAG>lightweight <PRE_TAG><PRE_TAG>convolution</POST_TAG>s</POST_TAG></POST_TAG> to complement the detailed information required by matting. To the best of our knowledge, ViTMatte is the first work to unleash the potential of ViT on <PRE_TAG>image matting</POST_TAG> with concise adaptation. It inherits many superior properties from ViT to matting, including various <PRE_TAG>pretraining strategies</POST_TAG>, <PRE_TAG>concise architecture design</POST_TAG>, and <PRE_TAG>flexible inference strategies</POST_TAG>. We evaluate ViTMatte on <PRE_TAG>Composition-1k</POST_TAG> and <PRE_TAG>Distinctions-646</POST_TAG>, the most commonly used benchmark for <PRE_TAG>image matting</POST_TAG>, our method achieves <PRE_TAG>state-of-the-art performance</POST_TAG> and outperforms prior matting works by a large margin.
Models citing this paper 5
Browse 5 models citing this paperDatasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 2
Collections including this paper 0
No Collection including this paper