Abstract
A multi-view 3D point tracker using a feed-forward model with transformers and k-nearest-neighbors achieves robust tracking with fewer cameras and less optimization compared to existing methods.
We introduce the first data-driven multi-view 3D point tracker, designed to track arbitrary points in dynamic scenes using multiple camera views. Unlike existing monocular trackers, which struggle with depth ambiguities and occlusion, or prior multi-camera methods that require over 20 cameras and tedious per-sequence optimization, our feed-forward model directly predicts 3D correspondences using a practical number of cameras (e.g., four), enabling robust and accurate online tracking. Given known camera poses and either sensor-based or estimated multi-view depth, our tracker fuses multi-view features into a unified point cloud and applies k-nearest-neighbors correlation alongside a transformer-based update to reliably estimate long-range 3D correspondences, even under occlusion. We train on 5K synthetic multi-view Kubric sequences and evaluate on two real-world benchmarks: Panoptic Studio and DexYCB, achieving median trajectory errors of 3.1 cm and 2.0 cm, respectively. Our method generalizes well to diverse camera setups of 1-8 views with varying vantage points and video lengths of 24-150 frames. By releasing our tracker alongside training and evaluation datasets, we aim to set a new standard for multi-view 3D tracking research and provide a practical tool for real-world applications. Project page available at https://ethz-vlg.github.io/mvtracker.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- SpatialTrackerV2: 3D Point Tracking Made Easy (2025)
- DELTAv2: Accelerating Dense 3D Tracking (2025)
- Pseudo Depth Meets Gaussian: A Feed-forward RGB SLAM Baseline (2025)
- MonoFusion: Sparse-View 4D Reconstruction via Monocular Fusion (2025)
- MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second (2025)
- Outdoor Monocular SLAM with Global Scale-Consistent 3D Gaussian Pointmaps (2025)
- Surf3R: Rapid Surface Reconstruction from Sparse RGB Views in Seconds (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper