VD-IT model

The is our pre-trained checkpoint for our paper Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation.

We use a video diffusion model (ModelScopeT2V) as our base model, applying prompt tuning to adapt it as a visual backbone for downstream video understanding tasks.

Model traning

We first pre-train our model on Ref-COCO and then fine-tune it on Ref-YouTube-VOS. The training of the models utilizes two NVIDIA A100 GPUs, processing 5 frames per clip over the course of 9 epochs. The initial learning rate is set to 5e-5 and reduced by a factor of 10 at the 6th and 8th epochs.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.