Safetensors
llava
k-m-irfan commited on
Commit
614053a
Β·
verified Β·
1 Parent(s): cd071c7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +30 -3
README.md CHANGED
@@ -1,8 +1,28 @@
1
  ---
2
  license: cc-by-sa-4.0
3
  ---
 
4
 
5
- # Model Card: VIMUL
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ## Requires
8
 
@@ -102,7 +122,14 @@ if __name__ == "__main__":
102
  ```
103
 
104
  ## Citation
105
-
106
  ```
107
-
 
 
 
 
 
 
 
 
108
  ```
 
1
  ---
2
  license: cc-by-sa-4.0
3
  ---
4
+ # ViMUL: A Culturally-diverse Multilingual Multimodal Video Model
5
 
6
+ [![πŸ€— Hugging Face](https://img.shields.io/badge/πŸ€—%20Hugging%20Face-Model-blue)](https://huggingface.co/MBZUAI/ViMUL)
7
+ [![πŸ“„ Paper](https://img.shields.io/badge/πŸ“„-Paper-red)](https://huggingface.co/papers/2506.07032)
8
+ [![🌐 Project Page](https://img.shields.io/badge/🌐-Project%20Page-green)](https://mbzuai-oryx.github.io/ViMUL/)
9
+ [![πŸ“Š Benchmark](https://img.shields.io/badge/πŸ“Š-ViMUL--Bench-orange)](https://huggingface.co/datasets/MBZUAI/ViMUL-Bench)
10
+
11
+ ## Overview
12
+ ViMUL is a multilingual video Large Multimodal Model (LMM) designed to provide better tradeoffs between high and low-resource languages for video understanding. The model is trained on a machine-translated multilingual video training set comprising 1.2 million samples and demonstrates improved performance across culturally diverse video content in multiple languages.
13
+
14
+ ## Key Features
15
+ - **🌍 Multilingual Support:** Optimized for 14 languages including both high and low-resource languages
16
+ - **πŸŽ₯ Video Understanding:** Specialized for multimodal video analysis and description
17
+ - **🎭 Cultural Awareness:** Enhanced understanding of culturally diverse content
18
+ - **βš–οΈ Balanced Performance:** Better tradeoff between high and low-resource language performance
19
+
20
+ ## Model Details
21
+ - **Base Architecture:** LLaVA-NeXT with Qwen backbone
22
+ - **Training Data:** 1.2M machine-translated multilingual video samples
23
+ - **Supported Languages:** English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, Japanese
24
+ - **Input Modalities:** Video | Image | Text
25
+ - **Output:** Text descriptions and analysis
26
 
27
  ## Requires
28
 
 
122
  ```
123
 
124
  ## Citation
 
125
  ```
126
+ @misc{shafique2025culturallydiversemultilingualmultimodalvideo,
127
+ title={A Culturally-diverse Multilingual Multimodal Video Benchmark & Model},
128
+ author={Bhuiyan Sanjid Shafique and Ashmal Vayani and Muhammad Maaz and Hanoona Abdul Rasheed and Dinura Dissanayake and Mohammed Irfan Kurpath and Yahya Hmaiti and Go Inoue and Jean Lahoud and Md. Safirur Rashid and Shadid Intisar Quasem and Maheen Fatima and Franco Vidal and Mykola Maslych and Ketan Pravin More and Sanoojan Baliah and Hasindri Watawana and Yuhao Li and Fabian Farestam and Leon Schaller and Roman Tymtsiv and Simon Weber and Hisham Cholakkal and Ivan Laptev and Shin'ichi Satoh and Michael Felsberg and Mubarak Shah and Salman Khan and Fahad Shahbaz Khan},
129
+ year={2025},
130
+ eprint={2506.07032},
131
+ archivePrefix={arXiv},
132
+ primaryClass={cs.CL},
133
+ url={https://arxiv.org/abs/2506.07032},
134
+ }
135
  ```