Jeckmu
/

Qwen2-VL-2B-Instruct-GPTQ-Int4-lora-SurveillanceVideo-250210

@@ -7,28 +7,64 @@ tags:
 - lora
 - generated_from_trainer
 model-index:
-- name: 250210_Abroad_LoRA_2B_INT4
   results: []
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
-# 250210_Abroad_LoRA_2B_INT4
-This model is a fine-tuned version of [Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct-GPTQ-Int4) on the qwen2_vl_dora dataset.
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure

 - lora
 - generated_from_trainer
 model-index:
+- name: Qwen2-VL-2B-Instruct-GPTQ-Int4-LoRA-SurveillanceVideo-Classification-250205
   results: []
+pipeline_tag: video-classification
 ---
 <!-- This model card has been generated automatically according to the information the Trainer had access to. You
 should probably proofread and complete it, then remove this comment. -->
+# Qwen2-VL-2B-Instruct-GPTQ-Int4-LoRA-SurveillanceVideo-Classification-250205
+This model is a fine-tuned version of [Qwen/Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct) on the Surveillance Video Classification dataset.
 ## Model description
+This model takes a video as input and classifies it into one of the following six classes
+[1. loitering, 2. breaking and entering, 3. abandonment, 4. falling down, 5. fighting, 6. arson]
+LLaMA-Factory was used for training, with the same hyperparameters as described below.
 ## Intended uses & limitations
+This Model Fine-tuned by the Prompt Below.
+The same is true when running inference.
+```python
+messages = [
+            {
+                "role": "user",
+                "content": [
+                    {
+                        "type": "video",
+                        "video": video_path,
+                        "max_pixels": 640 * 360,
+                        # "fps": 1.0   # maybe default fps = 1.0
+                    },
+                    {
+                        "type": "text",
+                        "text": (
+                            "<video>\nWatch the video and choose the six behaviours that apply to you. "
+                            "[1. loitering, 2. breaking and entering, 3. abandonment, 4. falling down, 5. fighting, 6. arson]. "
+                            "Your answer must be a single digit, the number of the behaviour."
+                        )
+                    }
+                ]
+            }
+        ]
+```
 ## Training and evaluation data
+The data used for training was sampled balanced for each class from the original video dataset and trained using 100 videos per class
+(except for the 6. arson class, which used 65 videos).
+Each video was preprocessed with a resolution of 640x360 and an option of fps=3.0,
+and a 10-second segment of the video where the behavior occurred according to the metadata was cut and used for training.
+(So, in total, we used about 30 frames).
+In the Inference course, you can use the same prompts as above.
+For training, we used the format of the above prompt with an additional class as the answer.
 ## Training procedure