File size: 3,253 Bytes
424baac
 
 
 
 
 
 
 
 
28b3ad1
424baac
9066789
424baac
 
 
 
 
28b3ad1
424baac
d132e80
424baac
 
 
9066789
 
 
 
424baac
 
 
9066789
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
424baac
 
 
9066789
 
 
 
 
 
 
 
 
424baac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
library_name: peft
license: other
base_model: Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4
tags:
- llama-factory
- lora
- generated_from_trainer
model-index:
- name: Qwen2-VL-2B-Instruct-GPTQ-Int4-LoRA-SurveillanceVideo-Classification-250210
  results: []
pipeline_tag: video-classification
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# Qwen2-VL-2B-Instruct-GPTQ-Int4-LoRA-SurveillanceVideo-Classification-250210

This model is a fine-tuned version of [Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4) on the Surveillance Video Classification dataset.

## Model description

This model takes a video as input and classifies it into one of the following six classes
[1. loitering, 2. breaking and entering, 3. abandonment, 4. falling down, 5. fighting, 6. arson]

LLaMA-Factory was used for training, with the same hyperparameters as described below.

## Intended uses & limitations

This Model Fine-tuned by the Prompt Below.
The same is true when running inference.
```python
messages = [
            {
                "role": "user",
                "content": [
                    {
                        "type": "video",
                        "video": video_path,
                        "max_pixels": 640 * 360,
                        # "fps": 1.0   # maybe default fps = 1.0
                    },
                    {
                        "type": "text",
                        "text": (
                            "<video>\nWatch the video and choose the six behaviours that apply to you. "
                            "[1. loitering, 2. breaking and entering, 3. abandonment, 4. falling down, 5. fighting, 6. arson]. "
                            "Your answer must be a single digit, the number of the behaviour."
                        )
                    }
                ]
            }
        ]
```

## Training and evaluation data

The data used for training was sampled balanced for each class from the original video dataset and trained using 100 videos per class 
(except for the 6. arson class, which used 65 videos).

Each video was preprocessed with a resolution of 640x360 and an option of fps=3.0, 
and a 10-second segment of the video where the behavior occurred according to the metadata was cut and used for training. 
(So, in total, we used about 30 frames).

In the Inference course, you can use the same prompts as above.
For training, we used the format of the above prompt with an additional class as the answer.

## Training procedure

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 2
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 16
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- num_epochs: 3.0
- mixed_precision_training: Native AMP

### Training results



### Framework versions

- PEFT 0.12.0
- Transformers 4.48.2
- Pytorch 2.6.0+cu124
- Datasets 3.2.0
- Tokenizers 0.21.0