VD-IT model
The is our pre-trained checkpoint for our paper Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation.
We use a video diffusion model (ModelScopeT2V) as our base model, applying prompt tuning to adapt it as a visual backbone for downstream video understanding tasks.
Model traning
We first pre-train our model on Ref-COCO and then fine-tune it on Ref-YouTube-VOS. The training of the models utilizes two NVIDIA A100 GPUs, processing 5 frames per clip over the course of 9 epochs. The initial learning rate is set to 5e-5 and reduced by a factor of 10 at the 6th and 8th epochs.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.