arxiv:2207.05501

Next-ViT: Next Generation Vision Transformer for Efficient Deployment in Realistic Industrial Scenarios

Published on Jul 12, 2022

Upvote

Authors:

Xin Xia ,

Huixia Li ,

Abstract

Due to the complex attention mechanisms and model design, most existing <PRE_TAG><PRE_TAG>vision Transformers (ViTs)</POST_TAG></POST_TAG> can not perform as efficiently as convolutional neural networks (CNNs) in realistic industrial deployment scenarios, e.g. TensorRT and CoreML. This poses a distinct challenge: Can a visual neural network be designed to infer as fast as CNNs and perform as powerful as ViTs? Recent works have tried to design CNN-Transformer hybrid architectures to address this issue, yet the overall performance of these works is far away from satisfactory. To end these, we propose a next generation vision Transformer for efficient deployment in realistic industrial scenarios, namely <PRE_TAG><PRE_TAG>Next-ViT</POST_TAG></POST_TAG>, which dominates both CNNs and ViTs from the perspective of latency/accuracy trade-off. In this work, the <PRE_TAG><PRE_TAG>Next Convolution Block (<PRE_TAG>NCB)</POST_TAG></POST_TAG></POST_TAG> and Next Transformer Block (NTB) are respectively developed to capture local and global information with deployment-friendly mechanisms. Then, <PRE_TAG><PRE_TAG>Next Hybrid Strategy (NHS)</POST_TAG></POST_TAG> is designed to stack NCB and NTB in an efficient hybrid paradigm, which boosts performance in various downstream tasks. Extensive experiments show that <PRE_TAG><PRE_TAG>Next-ViT</POST_TAG></POST_TAG> significantly outperforms existing CNNs, ViTs and CNN-Transformer hybrid architectures with respect to the latency/accuracy trade-off across various vision tasks. On TensorRT, <PRE_TAG><PRE_TAG>Next-ViT</POST_TAG></POST_TAG> surpasses ResNet by 5.5 mAP (from 40.4 to 45.9) on COCO detection and 7.7% mIoU (from 38.8% to 46.5%) on ADE20K segmentation under similar latency. Meanwhile, it achieves comparable performance with CSWin, while the inference speed is accelerated by 3.6x. On CoreML, <PRE_TAG><PRE_TAG>Next-ViT</POST_TAG></POST_TAG> surpasses EfficientFormer by 4.6 mAP (from 42.6 to 47.2) on COCO detection and 3.5% mIoU (from 45.1% to 48.6%) on <PRE_TAG>ADE20K segmentation</POST_TAG> under similar latency. Our code and models are made public at: https://github.com/bytedance/<PRE_TAG><PRE_TAG>Next-ViT</POST_TAG></POST_TAG>