SpatialVID
AI & ML interests
None defined yet.
Recent Activity
SpatialVID: A Large Scale Video Dataset with Spatial Annotations
Abstract
Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for dynamic scenes with realistic camera motion. To address this gap, we collect a large corpus of raw video with natural camera movement, providing the foundation for constructing a dataset with unique scale and diversity. In this work, we introduce SpatialVID, a large-scale dynamic spatial dataset explicitly designed to provide expressive annotations for this purpose. Through a hierarchical filtering pipeline, we process more than 21,000 hours of collected raw video into 2.7 million clips, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and labels for camera motion and scene composition.
Demonstration
![]() |
![]() |
![]() |
Shot Immersion: The camera moves smoothly forward through a shadowed alley, its path tilting slightly downward as it weaves between weathered stone walls. As it emerges into a sun-drenched courtyard, the scene unfolds—stone steps lined with blooming flowers, soft light dancing on ancient stonework, and a quiet, timeless charm enveloping the space. |
![]() |
![]() |
![]() |
Shot Immersion: The camera drifts forward through the airy, sun-drenched room, gliding past sleek sofas and a polished coffee table. As it moves left, the expansive space unfolds, highlighting the elegant design and warm ambiance of the luxurious living area. |
![]() |
![]() |
![]() |
Shot Immersion: The camera surges forward through the air, descending along the rocky ridge as mist curls below. A gentle shift to the right reveals the sheer drop of the valley, the soft light casting long shadows across the barren slopes, capturing the raw beauty of the untamed landscape. |
![]() |
![]() |
![]() |
Shot Immersion: The camera smoothly drifts right along a rain-slicked street, its path illuminated by glowing neon signs. Pedestrians with umbrellas move past brightly lit shops, their reflections shimmering on the wet pavement as the scene pulses with quiet urban energy. |
Dataset Statistics
Curation Pipeline
For more details about the dataset curation pipeline, please refer to our GitHub Code.
License of SpatialVID
SpatialVID is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC-BY-NC-SA-4.0). Users must attribute the original source, use the resource only for non-commercial purposes, and release any modified/derived works under the same license. For the full license text, visit https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.
Citation
If you find this project useful for your research, please cite our paper.