# Audio-Visual Instance Segmentation [](https://arxiv.org/abs/2412.03069) [](https://ruohaoguo.github.io/avis/) [](https://1drv.ms/u/c/3c9af704fb61931d/EVOs609SGMxLsbvVzVJHAa4Bmnu4GVZGjqYHQxDz0NKTew?e=WQU2Uf) Ruohao Guo, Xianghua Ying*, Yaru Chen, Dantong Niu, Guangyao Li, Liao Qu, Yanyu Qi, Jinxing Zhou, Bowei Xing, Wenzhen Yue, Ji Shi, Qixun Wang, Peiliang Zhang, Buwen Liang ## 📰 News 🔥**2025.03.01**: Codes and checkpoints are released! 🔥**2025.02.27**: AVIS got accepted to **CVPR 2025**! 🎉🎉🎉 🔥**2024.11.12**: Our [project page](https://ruohaoguo.github.io/avis/) is now available! 🔥**2024.11.11**: The AVISeg dataset has been uploaded to [OneDrive](https://1drv.ms/u/c/3c9af704fb61931d/EVOs609SGMxLsbvVzVJHAa4Bmnu4GVZGjqYHQxDz0NKTew?e=WQU2Uf), welcome to download and use! ## 🌿 Introduction In this paper, we propose a new multi-modal task, termed audio-visual instance segmentation (AVIS), which aims to simultaneously identify, segment and track individual sounding object instances in audible videos. To facilitate this research, we introduce a high-quality benchmark named AVISeg, containing over 90K instance masks from 26 semantic categories in 926 long videos. Additionally, we propose a strong baseline model for this task. Our model first localizes sound source within each frame, and condenses object-specific contexts into concise tokens. Then it builds long-range audio-visual dependencies between these tokens using window-based attention, and tracks sounding objects among the entire video sequences.
Backbone | Pre-trained Datasets | FSLA | HOTA | mAP | Model Weight |
---|---|---|---|---|---|
ResNet-50 | ImageNet | 42.78 | 61.73 | 40.57 | AVISM_R50_IN.pth |
ResNet-50 | ImageNet & COCO | 44.42 | 64.52 | 45.04 | AVISM_R50_COCO.pth |
Swin-L | ImageNet | 49.15 | 68.81 | 49.06 | AVISM_SwinL_IN.pth |
Swin-L | ImageNet & COCO | 52.49 | 71.13 | 53.46 | AVISM_SwinL_COCO.pth |