Jeckmu
/

Qwen2-VL-2B-Instruct-GPTQ-Int4-lora-SurveillanceVideo-250210

Video Classification

Generated from Trainer

Model card Files Files and versions Community

Qwen2-VL-2B-Instruct-GPTQ-Int4-lora-SurveillanceVideo-250210 / README.md

Jeckmu's picture

Update README.md

d132e80 verified 12 days ago

|

history blame contribute delete

3.25 kB

	---
	library_name: peft
	license: other
	base_model: Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4
	tags:
	- llama-factory
	- lora
	- generated_from_trainer
	model-index:
	- name: Qwen2-VL-2B-Instruct-GPTQ-Int4-LoRA-SurveillanceVideo-Classification-250210
	results: []
	pipeline_tag: video-classification
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# Qwen2-VL-2B-Instruct-GPTQ-Int4-LoRA-SurveillanceVideo-Classification-250210

	This model is a fine-tuned version of [Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4) on the Surveillance Video Classification dataset.

	## Model description

	This model takes a video as input and classifies it into one of the following six classes
	[1. loitering, 2. breaking and entering, 3. abandonment, 4. falling down, 5. fighting, 6. arson]

	LLaMA-Factory was used for training, with the same hyperparameters as described below.

	## Intended uses & limitations

	This Model Fine-tuned by the Prompt Below.
	The same is true when running inference.
	```python
	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "video",
	"video": video_path,
	"max_pixels": 640 * 360,
	# "fps": 1.0 # maybe default fps = 1.0
	},
	{
	"type": "text",
	"text": (
	"<video>\nWatch the video and choose the six behaviours that apply to you. "
	"[1. loitering, 2. breaking and entering, 3. abandonment, 4. falling down, 5. fighting, 6. arson]. "
	"Your answer must be a single digit, the number of the behaviour."
	)
	}
	]
	}
	]
	```

	## Training and evaluation data

	The data used for training was sampled balanced for each class from the original video dataset and trained using 100 videos per class
	(except for the 6. arson class, which used 65 videos).

	Each video was preprocessed with a resolution of 640x360 and an option of fps=3.0,
	and a 10-second segment of the video where the behavior occurred according to the metadata was cut and used for training.
	(So, in total, we used about 30 frames).

	In the Inference course, you can use the same prompts as above.
	For training, we used the format of the above prompt with an additional class as the answer.

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 5e-05
	- train_batch_size: 2
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 8
	- total_train_batch_size: 16
	- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: cosine
	- num_epochs: 3.0
	- mixed_precision_training: Native AMP

	### Training results



	### Framework versions

	- PEFT 0.12.0
	- Transformers 4.48.2
	- Pytorch 2.6.0+cu124
	- Datasets 3.2.0
	- Tokenizers 0.21.0