CostFormer:Cost Transformer for Cost Aggregation in Multi-view Stereo
Abstract
The core of Multi-view Stereo(MVS) is the matching process among reference and source pixels. Cost aggregation plays a significant role in this process, while previous methods focus on handling it via <PRE_TAG>CNNs</POST_TAG>. This may inherit the <PRE_TAG>natural limitation of <PRE_TAG><PRE_TAG>CNNs</POST_TAG></POST_TAG></POST_TAG> that fail to discriminate repetitive or incorrect matches due to limited <PRE_TAG>local receptive fields</POST_TAG>. To handle the issue, we aim to involve <PRE_TAG>Transformer</POST_TAG> into <PRE_TAG>cost aggregation</POST_TAG>. However, another problem may occur due to the quadratically growing <PRE_TAG>computational complexity</POST_TAG> caused by <PRE_TAG>Transformer</POST_TAG>, resulting in <PRE_TAG>memory overflow</POST_TAG> and <PRE_TAG>inference latency</POST_TAG>. In this paper, we overcome these limits with an efficient <PRE_TAG>Transformer</POST_TAG>-based <PRE_TAG>cost aggregation</POST_TAG> network, namely <PRE_TAG>CostFormer</POST_TAG>. The Residual Depth-Aware Cost <PRE_TAG>Transformer</POST_TAG>(RDACT) is proposed to aggregate long-range features on cost volume via self-attention mechanisms along the depth and spatial dimensions. Furthermore, Residual Regression <PRE_TAG>Transformer</POST_TAG>(RRT) is proposed to enhance spatial attention. The proposed method is a universal plug-in to improve learning-based MVS methods.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper