Organization Card

SpatialVID: A Large Scale Video Dataset with Spatial Annotations

Jiahao Wang¹ Yufeng Yuan¹ Rujie Zheng¹ Youtian Lin¹ Yi Zhang¹ Yajie Bao¹ Lin-Zhuo Chen¹

Yanxi Zhou¹ Xiaoxiao Long¹ Hao Zhu¹ Zhaoxiang Zhang² Xun Cao¹ Yao Yao^1†

¹Nanjing University ²Institute of Automation, Chinese Academy of Science

Abstract

Significant progress has been made in spatial intelligence, spanning both spatial reconstruction and world exploration. However, the scalability and real-world fidelity of current models remain severely constrained by the scarcity of large-scale, high-quality training data. While several datasets provide camera pose information, they are typically limited in scale, diversity, and annotation richness, particularly for dynamic scenes with realistic camera motion. To address this gap, we collect a large corpus of raw video with natural camera movement, providing the foundation for constructing a dataset with unique scale and diversity. In this work, we introduce SpatialVID, a large-scale dynamic spatial dataset explicitly designed to provide expressive annotations for this purpose. Through a hierarchical filtering pipeline, we process more than 21,000 hours of collected raw video into 2.7 million clips, totaling 7,089 hours of dynamic content. A subsequent annotation pipeline enriches these clips with detailed spatial and semantic information, including camera poses, depth maps, dynamic masks, structured captions, and labels for camera motion and scene composition.

Demonstration

			Shot Immersion: The camera moves smoothly forward through a shadowed alley, its path tilting slightly downward as it weaves between weathered stone walls. As it emerges into a sun-drenched courtyard, the scene unfolds—stone steps lined with blooming flowers, soft light dancing on ancient stonework, and a quiet, timeless charm enveloping the space.
			Shot Immersion: The camera drifts forward through the airy, sun-drenched room, gliding past sleek sofas and a polished coffee table. As it moves left, the expansive space unfolds, highlighting the elegant design and warm ambiance of the luxurious living area.
			Shot Immersion: The camera surges forward through the air, descending along the rocky ridge as mist curls below. A gentle shift to the right reveals the sheer drop of the valley, the soft light casting long shadows across the barren slopes, capturing the raw beauty of the untamed landscape.
			Shot Immersion: The camera smoothly drifts right along a rain-slicked street, its path illuminated by glowing neon signs. Pedestrians with umbrellas move past brightly lit shops, their reflections shimmering on the wet pavement as the scene pulses with quiet urban energy.

Dataset Statistics

Curation Pipeline

For more details about the dataset curation pipeline, please refer to our GitHub Code.

License of SpatialVID

SpatialVID is released under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License (CC-BY-NC-SA-4.0). Users must attribute the original source, use the resource only for non-commercial purposes, and release any modified/derived works under the same license. For the full license text, visit https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode.

Citation

If you find this project useful for your research, please cite our paper.

models 0

None public yet

datasets 1

SpatialVID/SpatialVID-HQ

Viewer • Updated about 13 hours ago • 397k • 89

SpatialVID

AI & ML interests

Recent Activity

SpatialVID: A Large Scale Video Dataset with Spatial Annotations

Abstract

Demonstration

Dataset Statistics

Curation Pipeline

License of SpatialVID

Citation

models 0

datasets 1

SpatialVID/SpatialVID-HQ

AI & ML interests

Recent Activity

Team members 2

SpatialVID: A Large Scale Video Dataset with Spatial Annotations

Abstract

Demonstration

Dataset Statistics

Curation Pipeline

License of SpatialVID

Citation

models 0

datasets 1