VerIPO-7B-v1.0 / README.md

Update README.md

278882f verified 6 months ago

6.33 kB

	---
	datasets: none
	library_name: transformers
	licence: license
	license: apache-2.0
	language:
	- en
	base_model:
	- Qwen/Qwen2.5-VL-7B-Instruct
	pipeline_tag: video-text-to-text
	tags:
	- Video-LLMs
	- Long-Reasoning-Video-Model
	- Video-R1
	---
	<a href="https://arxiv.org/abs/2505.19000" target="_blank">
	<img alt="arXiv" src="https://img.shields.io/badge/arXiv-VerIPO-red?logo=arxiv" height="20" />
	</a>
	<a href="https://github.com/HITsz-TMG/VerIPO" style="display: inline-block; margin-right: 10px;">
	<img alt="GitHub Code" src="https://img.shields.io/badge/Code-VerIPO-white?&logo=github&logoColor=white" />
	</a>

	# VerIPO: Long Reasoning Video-R1 Model with Iterative Policy Optimization

	VerIPO is a fine-tuned version of [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL).
	It has been trained using [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF).

	## Quick start

	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
	from qwen_vl_utils import process_vision_info

	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"Uni-MoE/VerIPO-7B-v1.0",
	torch_dtype=torch.bfloat16,
	attn_implementation="flash_attention_2",
	device_map="auto",
	)
	processor = AutoProcessor.from_pretrained("Uni-MoE/VerIPO-7B-v1.0")
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "video",
	"video": "file:///path/to/video1.mp4",
	"max_pixels": 1282828,
	"max_frames": 128,
	"fps": 2.0
	},
	{"type": "text", "text": "Describe this video."},
	],
	}
	]
	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	**video_kwargs,
	)
	inputs = inputs.to(model.device)

	# Inference: Generation of the output
	generated_ids = model.generate(**inputs, max_new_tokens=4096, temperature=1e-6, repetition_penalty=1.05)
	generated_ids_trimmed = [
	out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	## Experimental Result

	\| Model \| Params \| VSI-Bench \| Video-MMMU \| MMU (mc) \| TOMATO \| LVBench \| Video-MME (w/o sub) \|
	\|---------------------\|----------\|-----------------\|------------\|----------\|--------\|--------------------------\|---------\|
	\| GPT-4o [64] \| - \| 34.0 \| 61.2 \| - \| 37.7 \| 48.9 \| 71.9 \|
	\| Gemini 1.5 Pro [59] \| - \| 45.4 \| 53.8 \| - \| 36.1 \| 33.1 \| 75.0 \|
	\| mPLUG-Owl3 [83] \| 7B \| - \| 42.0 \| - \| - \| 43.5 \| 53.5 \|
	\| LongVa [89] \| 7B \| 29.2 \| 23.9 \| - \| - \| - \| 52.6 \|
	\| LLaVA-Video [91] \| 7B \| 35.6 \| 36.1 \| - \| - \| - \| 63.3 \|
	\| LLaVA-OneVision [24]\| 7B \| 32.4 \| 33.8 \| 49.2 \| - \| - \| 58.2 \|
	\| VideoLLaMA2 [9] \| 7B \| - \| - \| 44.8 \| - \| - \| 47.9 \|
	\| VideoLLaMA3 [86] \| 7B \| - \| 47.0 \| - \| - \| 45.3 \| 66.2 \|
	\| VILA-1.5 [33] \| 8B \| 28.9 \| 20.8 \| - \| - \| - \| - \|
	\| InternV-L5 [33] \| 40B \| 31.2 \| 34.0 \| - \| - \| - \| 60.1 \|
	\| InternVL2 [63] \| 8B \| 34.6 \| 37.4 \| 39.0 \| 21.7 \| - \| 54.0 \|
	\| InternVL2 [63] \| 40B \| 36.0 \| - \| - \| 29.0 \| 39.6 \| 61.2 \|
	\| InternVL2.5 [8] \| 8B \| - \| - \| - \| - \| - \| 64.2 \|
	\| InternVL2.5 [8] \| 26B \| - \| - \| - \| - \| - \| 66.9 \|
	\| InternVideo2.5 [70] \| 8B \| - \| 43.0 \| - \| - \| 46.4 \| 65.1 \|
	\| Llama-3.2-Vision [62]\| 11B \| 20.6 \| 41.8 \| - \| 21.5 \| - \| 46.0 \|
	\| Gemma-3-JT [60] \| 12B \| 32.4 \| _57.2_ \| - \| 28.1 \| - \| 58.2 \|
	\| Kimi-VL [61] \| 16B (A3B)\| 37.4 \| 52.6 \| - \| _31.7_ \| - \| 67.8\|
	\| DeepSeek-VL2 [77] \| 28B (A4B)\| 21.7 \| - \| - \| 27.2 \| - \| - \|
	\| Qwen2.5-VL [2] \| 7B \| 37.5 \| 54.3 \| 67.2 \| 29.3 \| 42.8 \| 66.2 \|
	\| TinyLLaVA-Video-R1 [90]\| 3B \| - \| - \| 46.9 \| - \| - \| 46.6 \|
	\| Qwen2.5-VL (thinking)[2]\| 7B \| _23.8_ \| 46.8 \| 63.0 \| 25.8 \| _35.2_ \| 60.4 \|
	\| Video-R1 [18] \| 7B \| 35.8 \| _52.3_ \| 64.3 \| - \| - \| 59.3 \|
	\| Kimi-VL-Thinking [61]\| 16B (A3B)\| 32.2 \| - \| 56.8 \| 20.6 \| 30.0 \| - \|
	\| VerlPo (Iteration1) \| 7B \| 41.8 \| 56.2 \| 65.9 \| _31.6_ \| _41.5_ \| 67.2 \|
	\| VerlPo (Iteration2) \| 7B \| 41.0 \| 57.9 \| 66.9 \| 31.5 \| 41.7 \| 67.6 \|
	\| VerlPo (Iteration3) \| 7B \| _41.3_ \| 56.8 \| _66.7_ \| 32.2 \| 41.7 \| 67.2 \|


	# Citations

	```bibtex
	@article{li2025veripo,
	title={VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization},
	author={Li, Yunxin and Chen, Xinyu and Li, Zitao and Liu, Zhenyu and Wang, Longyue and Luo, Wenhan and Hu, Baotian and Zhang, Min},
	journal={arXiv preprint arXiv:2505.19000},
	year={2025}
	}
	```