English

TimeChat-7B-ActivityNet-VTune Model

Model details

We trained VideoLLaMA using VTune, a developed instruction-tuning method specifically designed to account for consistency.

For the tuning, we utilized 10K training videos from ActivityNet-Captions with 205K automatically generated annotations.

Evaluation

We evaluated the model on ActivtyNet-CON and ActivtyNet-Captions.

  • ActivityNet-CON

    Metric Value
    Ground 33.0
    R-Ground 24.7 (74.8)
    S-Ground 10.0 (30.2)
    H-Verify 20.2 (61.1)
    C-Verify 17.7 (53.7)
  • ActivityNet-Captions

    Metric Value
    R@1 IoU=0.3 51.58
    R@1 IoU=0.5 34.38
    R@1 IoU=0.7 19.18
    mIoU 36.16

Paper and Code for more information: Paper, Code

Citation

If you find our research and codes useful, please consider starring our repository and citing our paper:

@article{jung2024consistency,
  title={On the Consistency of Video Large Language Models in Temporal Comprehension},
  author={Jung, Minjoon and Xiao, Junbin and Zhang, Byoung-Tak and Yao, Angela},
  journal={arXiv preprint arXiv:2411.12951},
  year={2024}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Dataset used to train mjjung/VideoLLaMA-7B-ActivityNet-VTune

Collection including mjjung/VideoLLaMA-7B-ActivityNet-VTune