ParsaKhaz commited on
Commit
45ac42f
·
verified ·
1 Parent(s): b1ccfe2

Upload folder using huggingface_hub

Browse files
Files changed (6) hide show
  1. .gitignore +51 -0
  2. README.md +187 -7
  3. app.py +159 -0
  4. main.py +578 -0
  5. packages.txt +2 -0
  6. requirements.txt +11 -0
.gitignore ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+
23
+ # Virtual Environment
24
+ venv/
25
+ env/
26
+ ENV/
27
+ .venv/
28
+
29
+ # IDE
30
+ .idea/
31
+ .vscode/
32
+ *.swp
33
+ *.swo
34
+
35
+ # Project specific
36
+ inputs/*
37
+ outputs/*
38
+ !inputs/.gitkeep
39
+ !outputs/.gitkeep
40
+ inputs/
41
+ outputs/
42
+
43
+ # Model files
44
+ *.pth
45
+ *.onnx
46
+ *.pt
47
+
48
+ # Logs
49
+ *.log
50
+
51
+ certificate.pem
README.md CHANGED
@@ -1,12 +1,192 @@
1
  ---
2
- title: Redact Video Demo
3
- emoji: 💻
4
- colorFrom: red
5
- colorTo: purple
6
  sdk: gradio
7
  sdk_version: 5.13.2
8
- app_file: app.py
9
- pinned: false
10
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: redact-video-demo
3
+ app_file: app.py
 
 
4
  sdk: gradio
5
  sdk_version: 5.13.2
 
 
6
  ---
7
+ # Video Object Detection with Moondream
8
+
9
+ This tool uses Moondream2, a powerful yet lightweight vision-language model, to detect and visualize objects in videos. Moondream can recognize a wide variety of objects, people, text, and more with high accuracy while being much smaller than traditional models.
10
+
11
+ ## About Moondream
12
+
13
+ Moondream is a tiny yet powerful vision-language model that can analyze images and answer questions about them. It's designed to be lightweight and efficient while maintaining high accuracy. Some key features:
14
+
15
+ - Only 2B parameters
16
+ - Fast inference with minimal resource requirements
17
+ - Supports CPU and GPU execution
18
+ - Open source and free to use
19
+ - Can detect almost anything you can describe in natural language
20
+
21
+ Links:
22
+ - [GitHub Repository](https://github.com/vikhyat/moondream)
23
+ - [Hugging Face Space](https://huggingface.co/vikhyatk/moondream2)
24
+ - [Python Package](https://pypi.org/project/moondream/)
25
+
26
+ ## Features
27
+
28
+ - Real-time object detection in videos using Moondream2
29
+ - Multiple visualization styles:
30
+ - Censor: Black boxes over detected objects
31
+ - YOLO: Traditional bounding boxes with labels
32
+ - Hitmarker: Call of Duty style crosshair markers
33
+ - Optional grid-based detection for improved accuracy
34
+ - Flexible object type detection using natural language
35
+ - Frame-by-frame processing with IoU-based merging
36
+ - Batch processing of multiple videos
37
+ - Web-compatible output format
38
+ - User-friendly web interface
39
+ - Command-line interface for automation
40
+
41
+ ## Requirements
42
+
43
+ - Python 3.8+
44
+ - OpenCV (cv2)
45
+ - PyTorch
46
+ - Transformers
47
+ - Pillow (PIL)
48
+ - tqdm
49
+ - ffmpeg
50
+ - numpy
51
+ - gradio (for web interface)
52
+
53
+ ## Installation
54
+
55
+ 1. Clone this repository and create a new virtual environment
56
+ ~~~bash
57
+ git clone https://github.com/parsakhaz/object-detect-video.git
58
+ python -m venv .venv
59
+ source .venv/bin/activate
60
+ ~~~
61
+ 2. Install the required packages:
62
+ ~~~bash
63
+ pip install -r requirements.txt
64
+ ~~~
65
+ 3. Install ffmpeg:
66
+ - On Ubuntu/Debian: `sudo apt-get install ffmpeg libvips`
67
+ - On macOS: `brew install ffmpeg`
68
+ - On Windows: Download from [ffmpeg.org](https://ffmpeg.org/download.html)
69
+
70
+ ## Usage
71
+
72
+ ### Web Interface
73
+
74
+ 1. Start the web interface:
75
+ ```bash
76
+ python app.py
77
+ ```
78
+
79
+ 2. Open the provided URL in your browser
80
+
81
+ 3. Use the interface to:
82
+ - Upload your video
83
+ - Specify what to censor (e.g., face, logo, text)
84
+ - Adjust processing speed and quality
85
+ - Configure grid size for detection
86
+ - Process and download the censored video
87
+
88
+ ### Command Line Interface
89
+
90
+ 1. Create an `inputs` directory in the same folder as the script:
91
+ ~~~bash
92
+ mkdir inputs
93
+ ~~~
94
+
95
+ 2. Place your video files in the `inputs` directory. Supported formats:
96
+ - .mp4
97
+ - .avi
98
+ - .mov
99
+ - .mkv
100
+ - .webm
101
+
102
+ 3. Run the script:
103
+ ~~~bash
104
+ python main.py
105
+ ~~~
106
+
107
+ ### Optional Arguments:
108
+ - `--test`: Process only first 3 seconds of each video (useful for testing detection settings)
109
+ ~~~bash
110
+ python main.py --test
111
+ ~~~
112
+
113
+ - `--preset`: Choose FFmpeg encoding preset (affects output quality vs. speed)
114
+ ~~~bash
115
+ python main.py --preset ultrafast # Fastest, lower quality
116
+ python main.py --preset veryslow # Slowest, highest quality
117
+ ~~~
118
+
119
+ - `--detect`: Specify what object type to detect (using natural language)
120
+ ```bash
121
+ python main.py --detect person # Detect people
122
+ python main.py --detect "red car" # Detect red cars
123
+ python main.py --detect "person wearing a hat" # Detect people with hats
124
+ ```
125
+
126
+ - `--box-style`: Choose visualization style
127
+ ```bash
128
+ python main.py --box-style censor # Black boxes (default)
129
+ python main.py --box-style yolo # YOLO-style boxes with labels
130
+ python main.py --box-style hitmarker # COD-style hitmarkers
131
+ ```
132
+
133
+ - `--rows` and `--cols`: Enable grid-based detection by splitting frames
134
+ ~~~bash
135
+ python main.py --rows 2 --cols 2 # Split each frame into 2x2 grid
136
+ python main.py --rows 3 --cols 3 # Split each frame into 3x3 grid
137
+ ```
138
+
139
+ You can combine arguments:
140
+ ```bash
141
+ python main.py --detect "person wearing sunglasses" --box-style yolo --test --preset "fast" --rows 2 --cols 2
142
+ ```
143
+
144
+ ### Visualization Styles
145
+
146
+ The tool supports three different visualization styles for detected objects:
147
+
148
+ 1. **Censor** (default)
149
+ - Places solid black rectangles over detected objects
150
+ - Best for privacy and content moderation
151
+ - Completely obscures the detected region
152
+
153
+ 2. **YOLO**
154
+ - Traditional object detection style
155
+ - Red bounding box around detected objects
156
+ - Label showing object type above the box
157
+ - Good for analysis and debugging
158
+
159
+ 3. **Hitmarker**
160
+ - Call of Duty inspired visualization
161
+ - White crosshair marker at center of detected objects
162
+ - Small label above the marker
163
+ - Stylistic choice for gaming-inspired visualization
164
+
165
+ Choose the style that best fits your use case using the `--box-style` argument.
166
+
167
+ ## Output
168
+
169
+ Processed videos will be saved in the `outputs` directory with the format:
170
+ `[style]_[object_type]_[original_filename].mp4`
171
+
172
+ For example:
173
+ - `censor_face_video.mp4`
174
+ - `yolo_person_video.mp4`
175
+ - `hitmarker_car_video.mp4`
176
+
177
+ The output videos will include:
178
+ - Original video content
179
+ - Selected visualization style for detected objects
180
+ - Web-compatible H.264 encoding
181
+
182
+ ## Notes
183
 
184
+ - Processing time depends on video length, grid size, and GPU availability
185
+ - GPU is strongly recommended for faster processing
186
+ - Requires sufficient disk space for temporary files
187
+ - Detection quality may vary based on object type and video quality
188
+ - Detection accuracy depends on Moondream2's ability to recognize the specified object type
189
+ - Grid-based detection should only be used when necessary due to significant performance impact
190
+ - Web interface provides real-time progress updates and error messages
191
+ - Different visualization styles may be more suitable for different use cases
192
+ - Moondream can detect almost anything you can describe in natural language
app.py ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ import gradio as gr
3
+ import os
4
+ from main import load_moondream, process_video
5
+ import tempfile
6
+ import shutil
7
+
8
+ # Get absolute path to workspace root
9
+ WORKSPACE_ROOT = os.path.dirname(os.path.abspath(__file__))
10
+
11
+ # Initialize model globally for reuse
12
+ print("Loading Moondream model...")
13
+ model, tokenizer = load_moondream()
14
+
15
+ def process_video_file(video_file, detect_keyword, box_style, ffmpeg_preset, rows, cols, test_mode):
16
+ """Process a video file through the Gradio interface."""
17
+ try:
18
+ if not video_file:
19
+ raise gr.Error("Please upload a video file")
20
+
21
+ # Ensure input/output directories exist using absolute paths
22
+ inputs_dir = os.path.join(WORKSPACE_ROOT, 'inputs')
23
+ outputs_dir = os.path.join(WORKSPACE_ROOT, 'outputs')
24
+ os.makedirs(inputs_dir, exist_ok=True)
25
+ os.makedirs(outputs_dir, exist_ok=True)
26
+
27
+ # Copy uploaded video to inputs directory
28
+ video_filename = f"input_{os.path.basename(video_file)}"
29
+ input_video_path = os.path.join(inputs_dir, video_filename)
30
+ shutil.copy2(video_file, input_video_path)
31
+
32
+ try:
33
+ # Process the video
34
+ output_path = process_video(
35
+ input_video_path,
36
+ detect_keyword,
37
+ test_mode=test_mode,
38
+ ffmpeg_preset=ffmpeg_preset,
39
+ rows=rows,
40
+ cols=cols,
41
+ box_style=box_style
42
+ )
43
+
44
+ # Verify output exists and is readable
45
+ if not output_path or not os.path.exists(output_path):
46
+ print(f"Warning: Output path {output_path} does not exist")
47
+ # Try to find the output based on expected naming convention
48
+ expected_output = os.path.join(outputs_dir, f'{box_style}_{detect_keyword}_{video_filename}')
49
+ if os.path.exists(expected_output):
50
+ output_path = expected_output
51
+ else:
52
+ # Try searching in outputs directory for any matching file
53
+ matching_files = [f for f in os.listdir(outputs_dir) if f.startswith(f'{box_style}_{detect_keyword}_')]
54
+ if matching_files:
55
+ output_path = os.path.join(outputs_dir, matching_files[0])
56
+ else:
57
+ raise gr.Error("Failed to locate output video")
58
+
59
+ # Convert output path to absolute path if it isn't already
60
+ if not os.path.isabs(output_path):
61
+ output_path = os.path.join(WORKSPACE_ROOT, output_path)
62
+
63
+ print(f"Returning output path: {output_path}")
64
+ return output_path
65
+
66
+ finally:
67
+ # Clean up input file
68
+ try:
69
+ if os.path.exists(input_video_path):
70
+ os.remove(input_video_path)
71
+ except:
72
+ pass
73
+
74
+ except Exception as e:
75
+ print(f"Error in process_video_file: {str(e)}")
76
+ raise gr.Error(f"Error processing video: {str(e)}")
77
+
78
+ # Create the Gradio interface
79
+ with gr.Blocks(title="Video Object Detection with Moondream") as app:
80
+ gr.Markdown("# Video Object Detection with Moondream")
81
+ gr.Markdown("""
82
+ This app uses [Moondream](https://github.com/vikhyat/moondream), a powerful yet lightweight vision-language model,
83
+ to detect and visualize objects in videos. Moondream can recognize a wide variety of objects, people, text, and more
84
+ with high accuracy while being much smaller than traditional models.
85
+
86
+ Upload a video and specify what you want to detect. The app will process each frame using Moondream and visualize
87
+ the detections using your chosen style.
88
+ """)
89
+
90
+ with gr.Row():
91
+ with gr.Column():
92
+ # Input components
93
+ video_input = gr.Video(label="Upload Video")
94
+ detect_input = gr.Textbox(
95
+ label="What to Detect",
96
+ placeholder="e.g. face, logo, text, person, car, dog, etc.",
97
+ value="face",
98
+ info="Moondream can detect almost anything you can describe in natural language"
99
+ )
100
+ box_style_input = gr.Radio(
101
+ choices=['censor', 'yolo', 'hitmarker'],
102
+ value='censor',
103
+ label="Visualization Style",
104
+ info="Choose how to display detections"
105
+ )
106
+ preset_input = gr.Dropdown(
107
+ choices=['ultrafast', 'superfast', 'veryfast', 'faster', 'fast', 'medium', 'slow', 'slower', 'veryslow'],
108
+ value='medium',
109
+ label="Processing Speed (faster = lower quality)"
110
+ )
111
+ with gr.Row():
112
+ rows_input = gr.Slider(minimum=1, maximum=4, value=1, step=1, label="Grid Rows")
113
+ cols_input = gr.Slider(minimum=1, maximum=4, value=1, step=1, label="Grid Columns")
114
+
115
+ test_mode_input = gr.Checkbox(
116
+ label="Test Mode (Process first 3 seconds only)",
117
+ value=True,
118
+ info="Enable to quickly test settings on a short clip before processing the full video (recommended)"
119
+ )
120
+
121
+ process_btn = gr.Button("Process Video", variant="primary")
122
+ gr.Markdown("""
123
+ Note: Processing in test mode will only process the first 3 seconds of the video and is recommended for testing settings.
124
+ """)
125
+
126
+ gr.Markdown("""
127
+ We can get a rough estimate of how long the video will take to process by multiplying the videos framerate * seconds * the number of rows and columns and assuming 0.12 seconds processing time per detection.
128
+ For example, a 3 second video at 30fps with 2x2 grid, the estimated time is 3 * 30 * 2 * 2 * 0.12 = 43.2 seconds (tested on a 4090 GPU).
129
+ """)
130
+
131
+ with gr.Column():
132
+ # Output components
133
+ video_output = gr.Video(label="Processed Video")
134
+
135
+ # About section under the video output
136
+ gr.Markdown("""
137
+ ### About Moondream
138
+ Moondream is a tiny yet powerful vision-language model that can analyze images and answer questions about them.
139
+ It's designed to be lightweight and efficient while maintaining high accuracy. Some key features:
140
+ - Only 2B parameters (compared to 80B+ in other models)
141
+ - Fast inference with minimal resource requirements
142
+ - Supports CPU and GPU execution
143
+ - Open source and free to use
144
+
145
+ Links:
146
+ - [GitHub Repository](https://github.com/vikhyat/moondream)
147
+ - [Hugging Face Space](https://huggingface.co/vikhyatk/moondream2)
148
+ - [Python Package](https://pypi.org/project/moondream/)
149
+ """)
150
+
151
+ # Event handlers
152
+ process_btn.click(
153
+ fn=process_video_file,
154
+ inputs=[video_input, detect_input, box_style_input, preset_input, rows_input, cols_input, test_mode_input],
155
+ outputs=video_output
156
+ )
157
+
158
+ if __name__ == "__main__":
159
+ app.launch(share=True)
main.py ADDED
@@ -0,0 +1,578 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ import cv2, os, subprocess, argparse
3
+ from PIL import Image
4
+ import torch
5
+ from transformers import AutoModelForCausalLM, AutoTokenizer
6
+ from tqdm import tqdm
7
+ import numpy as np
8
+ from datetime import datetime
9
+
10
+ # Constants
11
+ TEST_MODE_DURATION = 3 # Process only first 3 seconds in test mode
12
+ FFMPEG_PRESETS = ['ultrafast', 'superfast', 'veryfast', 'faster', 'fast', 'medium', 'slow', 'slower', 'veryslow']
13
+ FONT = cv2.FONT_HERSHEY_SIMPLEX # Font for YOLO-style labels
14
+
15
+ # Detection parameters
16
+ IOU_THRESHOLD = 0.5 # IoU threshold for considering boxes related
17
+
18
+ # Hitmarker parameters
19
+ HITMARKER_SIZE = 20 # Size of the hitmarker in pixels
20
+ HITMARKER_GAP = 3 # Size of the empty space in the middle (reduced from 8)
21
+ HITMARKER_THICKNESS = 2 # Thickness of hitmarker lines
22
+ HITMARKER_COLOR = (255, 255, 255) # White color for hitmarker
23
+ HITMARKER_SHADOW_COLOR = (80, 80, 80) # Lighter gray for shadow effect
24
+ HITMARKER_SHADOW_OFFSET = 1 # Smaller shadow offset
25
+
26
+ def load_moondream():
27
+ """Load Moondream model and tokenizer."""
28
+ model = AutoModelForCausalLM.from_pretrained(
29
+ "vikhyatk/moondream2",
30
+ trust_remote_code=True,
31
+ device_map={"": "cuda"}
32
+ )
33
+ tokenizer = AutoTokenizer.from_pretrained("vikhyatk/moondream2")
34
+ return model, tokenizer
35
+
36
+ def get_video_properties(video_path):
37
+ """Get basic video properties."""
38
+ video = cv2.VideoCapture(video_path)
39
+ fps = video.get(cv2.CAP_PROP_FPS)
40
+ frame_count = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
41
+ width = int(video.get(cv2.CAP_PROP_FRAME_WIDTH))
42
+ height = int(video.get(cv2.CAP_PROP_FRAME_HEIGHT))
43
+ video.release()
44
+ return {'fps': fps, 'frame_count': frame_count, 'width': width, 'height': height}
45
+
46
+ def is_valid_box(box):
47
+ """Check if box coordinates are reasonable."""
48
+ x1, y1, x2, y2 = box
49
+ width = x2 - x1
50
+ height = y2 - y1
51
+
52
+ # Reject boxes that are too large (over 90% of frame in both dimensions)
53
+ if width > 0.9 and height > 0.9:
54
+ return False
55
+
56
+ # Reject boxes that are too small (less than 1% of frame)
57
+ if width < 0.01 or height < 0.01:
58
+ return False
59
+
60
+ return True
61
+
62
+ def split_frame_into_tiles(frame, rows, cols):
63
+ """Split a frame into a grid of tiles."""
64
+ height, width = frame.shape[:2]
65
+ tile_height = height // rows
66
+ tile_width = width // cols
67
+ tiles = []
68
+ tile_positions = []
69
+
70
+ for i in range(rows):
71
+ for j in range(cols):
72
+ y1 = i * tile_height
73
+ y2 = (i + 1) * tile_height if i < rows - 1 else height
74
+ x1 = j * tile_width
75
+ x2 = (j + 1) * tile_width if j < cols - 1 else width
76
+
77
+ tile = frame[y1:y2, x1:x2]
78
+ tiles.append(tile)
79
+ tile_positions.append((x1, y1, x2, y2))
80
+
81
+ return tiles, tile_positions
82
+
83
+ def convert_tile_coords_to_frame(box, tile_pos, frame_shape):
84
+ """Convert coordinates from tile space to frame space."""
85
+ frame_height, frame_width = frame_shape[:2]
86
+ tile_x1, tile_y1, tile_x2, tile_y2 = tile_pos
87
+ tile_width = tile_x2 - tile_x1
88
+ tile_height = tile_y2 - tile_y1
89
+
90
+ x1_tile_abs = box[0] * tile_width
91
+ y1_tile_abs = box[1] * tile_height
92
+ x2_tile_abs = box[2] * tile_width
93
+ y2_tile_abs = box[3] * tile_height
94
+
95
+ x1_frame_abs = tile_x1 + x1_tile_abs
96
+ y1_frame_abs = tile_y1 + y1_tile_abs
97
+ x2_frame_abs = tile_x1 + x2_tile_abs
98
+ y2_frame_abs = tile_y1 + y2_tile_abs
99
+
100
+ x1_norm = x1_frame_abs / frame_width
101
+ y1_norm = y1_frame_abs / frame_height
102
+ x2_norm = x2_frame_abs / frame_width
103
+ y2_norm = y2_frame_abs / frame_height
104
+
105
+ x1_norm = max(0.0, min(1.0, x1_norm))
106
+ y1_norm = max(0.0, min(1.0, y1_norm))
107
+ x2_norm = max(0.0, min(1.0, x2_norm))
108
+ y2_norm = max(0.0, min(1.0, y2_norm))
109
+
110
+ return [x1_norm, y1_norm, x2_norm, y2_norm]
111
+
112
+ def merge_tile_detections(tile_detections, iou_threshold=0.5):
113
+ """Merge detections from different tiles using NMS-like approach."""
114
+ if not tile_detections:
115
+ return []
116
+
117
+ all_boxes = []
118
+ all_keywords = []
119
+
120
+ # Collect all boxes and their keywords
121
+ for detections in tile_detections:
122
+ for box, keyword in detections:
123
+ all_boxes.append(box)
124
+ all_keywords.append(keyword)
125
+
126
+ if not all_boxes:
127
+ return []
128
+
129
+ # Convert to numpy for easier processing
130
+ boxes = np.array(all_boxes)
131
+
132
+ # Calculate areas
133
+ x1 = boxes[:, 0]
134
+ y1 = boxes[:, 1]
135
+ x2 = boxes[:, 2]
136
+ y2 = boxes[:, 3]
137
+ areas = (x2 - x1) * (y2 - y1)
138
+
139
+ # Sort boxes by area
140
+ order = areas.argsort()[::-1]
141
+
142
+ keep = []
143
+ while order.size > 0:
144
+ i = order[0]
145
+ keep.append(i)
146
+
147
+ if order.size == 1:
148
+ break
149
+
150
+ # Calculate IoU with rest of boxes
151
+ xx1 = np.maximum(x1[i], x1[order[1:]])
152
+ yy1 = np.maximum(y1[i], y1[order[1:]])
153
+ xx2 = np.minimum(x2[i], x2[order[1:]])
154
+ yy2 = np.minimum(y2[i], y2[order[1:]])
155
+
156
+ w = np.maximum(0.0, xx2 - xx1)
157
+ h = np.maximum(0.0, yy2 - yy1)
158
+ inter = w * h
159
+
160
+ ovr = inter / (areas[i] + areas[order[1:]] - inter)
161
+
162
+ # Get indices of boxes with IoU less than threshold
163
+ inds = np.where(ovr <= iou_threshold)[0]
164
+ order = order[inds + 1]
165
+
166
+ return [(all_boxes[i], all_keywords[i]) for i in keep]
167
+
168
+ def detect_ads_in_frame(model, tokenizer, image, detect_keyword, rows=1, cols=1):
169
+ """Detect objects in a frame using grid-based detection."""
170
+ if rows == 1 and cols == 1:
171
+ return detect_ads_in_frame_single(model, tokenizer, image, detect_keyword)
172
+
173
+ # Convert numpy array to PIL Image if needed
174
+ if not isinstance(image, Image.Image):
175
+ image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
176
+
177
+ # Split frame into tiles
178
+ tiles, tile_positions = split_frame_into_tiles(image, rows, cols)
179
+
180
+ # Process each tile
181
+ tile_detections = []
182
+ for tile, tile_pos in zip(tiles, tile_positions):
183
+ # Convert tile to PIL Image
184
+ tile_pil = Image.fromarray(tile)
185
+
186
+ # Detect objects in tile
187
+ response = model.detect(tile_pil, detect_keyword)
188
+
189
+ if response and "objects" in response and response["objects"]:
190
+ objects = response["objects"]
191
+ tile_objects = []
192
+
193
+ for obj in objects:
194
+ if all(k in obj for k in ['x_min', 'y_min', 'x_max', 'y_max']):
195
+ box = [
196
+ obj['x_min'],
197
+ obj['y_min'],
198
+ obj['x_max'],
199
+ obj['y_max']
200
+ ]
201
+
202
+ if is_valid_box(box):
203
+ # Convert tile coordinates to frame coordinates
204
+ frame_box = convert_tile_coords_to_frame(box, tile_pos, image.shape)
205
+ tile_objects.append((frame_box, detect_keyword))
206
+
207
+ if tile_objects: # Only append if we found valid objects
208
+ tile_detections.append(tile_objects)
209
+
210
+ # Merge detections from all tiles
211
+ merged_detections = merge_tile_detections(tile_detections)
212
+ return merged_detections
213
+
214
+ def detect_ads_in_frame_single(model, tokenizer, image, detect_keyword):
215
+ """Single-frame detection function."""
216
+ detected_objects = []
217
+
218
+ # Convert numpy array to PIL Image if needed
219
+ if not isinstance(image, Image.Image):
220
+ image = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
221
+
222
+ # Detect objects
223
+ response = model.detect(image, detect_keyword)
224
+
225
+ # Check if we have valid objects
226
+ if response and "objects" in response and response["objects"]:
227
+ objects = response["objects"]
228
+
229
+ for obj in objects:
230
+ if all(k in obj for k in ['x_min', 'y_min', 'x_max', 'y_max']):
231
+ box = [
232
+ obj['x_min'],
233
+ obj['y_min'],
234
+ obj['x_max'],
235
+ obj['y_max']
236
+ ]
237
+ # If box is valid (not full-frame), add it
238
+ if is_valid_box(box):
239
+ detected_objects.append((box, detect_keyword))
240
+
241
+ return detected_objects
242
+
243
+ def draw_hitmarker(frame, center_x, center_y, size=HITMARKER_SIZE, color=HITMARKER_COLOR, shadow=True):
244
+ """Draw a COD-style hitmarker cross with more space in the middle."""
245
+ half_size = size // 2
246
+
247
+ # Draw shadow first if enabled
248
+ if shadow:
249
+ # Top-left to center shadow
250
+ cv2.line(frame,
251
+ (center_x - half_size + HITMARKER_SHADOW_OFFSET, center_y - half_size + HITMARKER_SHADOW_OFFSET),
252
+ (center_x - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET, center_y - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET),
253
+ HITMARKER_SHADOW_COLOR, HITMARKER_THICKNESS)
254
+ # Top-right to center shadow
255
+ cv2.line(frame,
256
+ (center_x + half_size + HITMARKER_SHADOW_OFFSET, center_y - half_size + HITMARKER_SHADOW_OFFSET),
257
+ (center_x + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET, center_y - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET),
258
+ HITMARKER_SHADOW_COLOR, HITMARKER_THICKNESS)
259
+ # Bottom-left to center shadow
260
+ cv2.line(frame,
261
+ (center_x - half_size + HITMARKER_SHADOW_OFFSET, center_y + half_size + HITMARKER_SHADOW_OFFSET),
262
+ (center_x - HITMARKER_GAP + HITMARKER_SHADOW_OFFSET, center_y + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET),
263
+ HITMARKER_SHADOW_COLOR, HITMARKER_THICKNESS)
264
+ # Bottom-right to center shadow
265
+ cv2.line(frame,
266
+ (center_x + half_size + HITMARKER_SHADOW_OFFSET, center_y + half_size + HITMARKER_SHADOW_OFFSET),
267
+ (center_x + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET, center_y + HITMARKER_GAP + HITMARKER_SHADOW_OFFSET),
268
+ HITMARKER_SHADOW_COLOR, HITMARKER_THICKNESS)
269
+
270
+ # Draw main hitmarker
271
+ # Top-left to center
272
+ cv2.line(frame,
273
+ (center_x - half_size, center_y - half_size),
274
+ (center_x - HITMARKER_GAP, center_y - HITMARKER_GAP),
275
+ color, HITMARKER_THICKNESS)
276
+ # Top-right to center
277
+ cv2.line(frame,
278
+ (center_x + half_size, center_y - half_size),
279
+ (center_x + HITMARKER_GAP, center_y - HITMARKER_GAP),
280
+ color, HITMARKER_THICKNESS)
281
+ # Bottom-left to center
282
+ cv2.line(frame,
283
+ (center_x - half_size, center_y + half_size),
284
+ (center_x - HITMARKER_GAP, center_y + HITMARKER_GAP),
285
+ color, HITMARKER_THICKNESS)
286
+ # Bottom-right to center
287
+ cv2.line(frame,
288
+ (center_x + half_size, center_y + half_size),
289
+ (center_x + HITMARKER_GAP, center_y + HITMARKER_GAP),
290
+ color, HITMARKER_THICKNESS)
291
+
292
+ def draw_ad_boxes(frame, detected_objects, detect_keyword, box_style='censor'):
293
+ """Draw detection visualizations over detected objects.
294
+
295
+ Args:
296
+ frame: The video frame to draw on
297
+ detected_objects: List of (box, keyword) tuples
298
+ detect_keyword: The detection keyword
299
+ box_style: Visualization style ('censor', 'yolo', or 'hitmarker')
300
+ """
301
+ height, width = frame.shape[:2]
302
+
303
+ for (box, keyword) in detected_objects:
304
+ try:
305
+ # Convert normalized coordinates to pixel coordinates
306
+ x1 = int(box[0] * width)
307
+ y1 = int(box[1] * height)
308
+ x2 = int(box[2] * width)
309
+ y2 = int(box[3] * height)
310
+
311
+ # Ensure coordinates are within frame boundaries
312
+ x1 = max(0, min(x1, width-1))
313
+ y1 = max(0, min(y1, height-1))
314
+ x2 = max(0, min(x2, width-1))
315
+ y2 = max(0, min(y2, height-1))
316
+
317
+ # Only draw if box has reasonable size
318
+ if x2 > x1 and y2 > y1:
319
+ if box_style == 'censor':
320
+ # Draw solid black rectangle
321
+ cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 0, 0), -1)
322
+ elif box_style == 'yolo':
323
+ # Draw red rectangle with thicker line
324
+ cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 0, 255), 3)
325
+
326
+ # Add label with background
327
+ label = detect_keyword # Use exact capitalization
328
+ label_size = cv2.getTextSize(label, FONT, 0.7, 2)[0]
329
+ cv2.rectangle(frame, (x1, y1-25), (x1 + label_size[0], y1), (0, 0, 255), -1)
330
+ cv2.putText(frame, label, (x1, y1-6), FONT, 0.7, (255, 255, 255), 2, cv2.LINE_AA)
331
+ elif box_style == 'hitmarker':
332
+ # Calculate center of the box
333
+ center_x = (x1 + x2) // 2
334
+ center_y = (y1 + y2) // 2
335
+
336
+ # Draw hitmarker at the center
337
+ draw_hitmarker(frame, center_x, center_y)
338
+
339
+ # Optional: Add small label above hitmarker
340
+ label = detect_keyword # Use exact capitalization
341
+ label_size = cv2.getTextSize(label, FONT, 0.5, 1)[0]
342
+ cv2.putText(frame, label,
343
+ (center_x - label_size[0]//2, center_y - HITMARKER_SIZE - 5),
344
+ FONT, 0.5, HITMARKER_COLOR, 1, cv2.LINE_AA)
345
+ except Exception as e:
346
+ print(f"Error drawing {box_style} style box: {str(e)}")
347
+
348
+ return frame
349
+
350
+ def filter_temporal_outliers(detections_dict):
351
+ """Filter out extremely large detections that take up most of the frame.
352
+ Only keeps detections that are reasonable in size.
353
+
354
+ Args:
355
+ detections_dict: Dictionary of {frame_number: [(box, keyword), ...]}
356
+ """
357
+ filtered_detections = {}
358
+
359
+ for t, detections in detections_dict.items():
360
+ # Only keep detections that aren't too large
361
+ valid_detections = []
362
+ for box, keyword in detections:
363
+ # Calculate box size as percentage of frame
364
+ width = box[2] - box[0]
365
+ height = box[3] - box[1]
366
+ area = width * height
367
+
368
+ # If box is less than 90% of frame, keep it
369
+ if area < 0.9:
370
+ valid_detections.append((box, keyword))
371
+
372
+ if valid_detections:
373
+ filtered_detections[t] = valid_detections
374
+
375
+ return filtered_detections
376
+
377
+ def describe_frames(video_path, model, tokenizer, detect_keyword, test_mode=False, rows=1, cols=1):
378
+ """Extract and detect objects in frames."""
379
+ props = get_video_properties(video_path)
380
+ fps = props['fps']
381
+
382
+ # If in test mode, only process first 3 seconds
383
+ if test_mode:
384
+ frame_count = min(int(fps * TEST_MODE_DURATION), props['frame_count'])
385
+ else:
386
+ frame_count = props['frame_count']
387
+
388
+ ad_detections = {} # Store detection results by frame number
389
+
390
+ print("Extracting frames and detecting objects...")
391
+ video = cv2.VideoCapture(video_path)
392
+
393
+ # Process every frame
394
+ frame_count_processed = 0
395
+ with tqdm(total=frame_count) as pbar:
396
+ while frame_count_processed < frame_count:
397
+ ret, frame = video.read()
398
+ if not ret:
399
+ break
400
+
401
+ # Detect objects in the frame
402
+ detected_objects = detect_ads_in_frame(model, tokenizer, frame, detect_keyword, rows=rows, cols=cols)
403
+
404
+ # Store results for every frame, even if empty
405
+ ad_detections[frame_count_processed] = detected_objects
406
+
407
+ frame_count_processed += 1
408
+ pbar.update(1)
409
+
410
+ video.release()
411
+
412
+ if frame_count_processed == 0:
413
+ print("No frames could be read from video")
414
+ return {}
415
+
416
+ # Filter out only extremely large detections
417
+ ad_detections = filter_temporal_outliers(ad_detections)
418
+ return ad_detections
419
+
420
+ def create_detection_video(video_path, ad_detections, detect_keyword, output_path=None, ffmpeg_preset='medium', test_mode=False, box_style='censor'):
421
+ """Create video with detection boxes."""
422
+ if output_path is None:
423
+ # Create outputs directory if it doesn't exist
424
+ outputs_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'outputs')
425
+ os.makedirs(outputs_dir, exist_ok=True)
426
+
427
+ # Clean the detect_keyword for filename
428
+ safe_keyword = "".join(x for x in detect_keyword if x.isalnum() or x in (' ', '_', '-'))
429
+ safe_keyword = safe_keyword.replace(' ', '_')
430
+
431
+ # Create output filename
432
+ base_name = os.path.splitext(os.path.basename(video_path))[0]
433
+ output_path = os.path.join(outputs_dir, f'{box_style}_{safe_keyword}_{base_name}.mp4')
434
+
435
+ print(f"Will save output to: {output_path}")
436
+
437
+ props = get_video_properties(video_path)
438
+ fps, width, height = props['fps'], props['width'], props['height']
439
+
440
+ # If in test mode, only process first few seconds
441
+ if test_mode:
442
+ frame_count = min(int(fps * TEST_MODE_DURATION), props['frame_count'])
443
+ else:
444
+ frame_count = props['frame_count']
445
+
446
+ video = cv2.VideoCapture(video_path)
447
+
448
+ # Create temp output path by adding _temp before the extension
449
+ base, ext = os.path.splitext(output_path)
450
+ temp_output = f"{base}_temp{ext}"
451
+
452
+ out = cv2.VideoWriter(
453
+ temp_output,
454
+ cv2.VideoWriter_fourcc(*'mp4v'),
455
+ fps,
456
+ (width, height)
457
+ )
458
+
459
+ print("Creating detection video...")
460
+ frame_count_processed = 0
461
+
462
+ with tqdm(total=frame_count) as pbar:
463
+ while frame_count_processed < frame_count:
464
+ ret, frame = video.read()
465
+ if not ret:
466
+ break
467
+
468
+ # Get detections for this exact frame
469
+ if frame_count_processed in ad_detections:
470
+ current_detections = ad_detections[frame_count_processed]
471
+ if current_detections:
472
+ frame = draw_ad_boxes(frame, current_detections, detect_keyword, box_style=box_style)
473
+
474
+ out.write(frame)
475
+ frame_count_processed += 1
476
+ pbar.update(1)
477
+
478
+ video.release()
479
+ out.release()
480
+
481
+ # Convert to web-compatible format more efficiently
482
+ try:
483
+ subprocess.run([
484
+ 'ffmpeg', '-y',
485
+ '-i', temp_output,
486
+ '-c:v', 'libx264',
487
+ '-preset', ffmpeg_preset,
488
+ '-crf', '23',
489
+ '-movflags', '+faststart', # Better web playback
490
+ '-loglevel', 'error',
491
+ output_path
492
+ ], check=True)
493
+
494
+ os.remove(temp_output) # Remove the temporary file
495
+
496
+ if not os.path.exists(output_path):
497
+ print(f"Warning: FFmpeg completed but output file not found at {output_path}")
498
+ return None
499
+
500
+ return output_path
501
+
502
+ except subprocess.CalledProcessError as e:
503
+ print(f"Error running FFmpeg: {str(e)}")
504
+ if os.path.exists(temp_output):
505
+ os.remove(temp_output)
506
+ return None
507
+
508
+ def process_video(video_path, detect_keyword, test_mode=False, ffmpeg_preset='medium', rows=1, cols=1, box_style='censor'):
509
+ """Process a single video file."""
510
+ print(f"\nProcessing: {video_path}")
511
+ print(f"Looking for: {detect_keyword}")
512
+
513
+ # Load model
514
+ print("Loading Moondream model...")
515
+ model, tokenizer = load_moondream()
516
+
517
+ # Process video - detect objects
518
+ ad_detections = describe_frames(video_path, model, tokenizer, detect_keyword, test_mode, rows, cols)
519
+
520
+ # Create video with detection boxes
521
+ output_path = create_detection_video(video_path, ad_detections, detect_keyword,
522
+ ffmpeg_preset=ffmpeg_preset, test_mode=test_mode,
523
+ box_style=box_style)
524
+
525
+ if output_path is None:
526
+ print("\nError: Failed to create output video")
527
+ return None
528
+
529
+ print(f"\nOutput saved to: {output_path}")
530
+ return output_path
531
+
532
+ def main():
533
+ """Process all videos in the inputs directory."""
534
+ parser = argparse.ArgumentParser(description='Detect objects in videos using Moondream2')
535
+ parser.add_argument('--test', action='store_true', help='Process only first 3 seconds of each video')
536
+ parser.add_argument('--preset', choices=FFMPEG_PRESETS, default='medium',
537
+ help='FFmpeg encoding preset (default: medium). Faster presets = lower quality')
538
+ parser.add_argument('--detect', type=str, default='face',
539
+ help='Object to detect in the video (default: face, use --detect "thing to detect" to override)')
540
+ parser.add_argument('--rows', type=int, default=1,
541
+ help='Number of rows to split each frame into (default: 1)')
542
+ parser.add_argument('--cols', type=int, default=1,
543
+ help='Number of columns to split each frame into (default: 1)')
544
+ parser.add_argument('--box-style', choices=['censor', 'yolo', 'hitmarker'], default='censor',
545
+ help='Style of detection visualization (default: censor)')
546
+ args = parser.parse_args()
547
+
548
+ input_dir = 'inputs'
549
+ os.makedirs(input_dir, exist_ok=True)
550
+ os.makedirs('outputs', exist_ok=True)
551
+
552
+ video_files = [f for f in os.listdir(input_dir)
553
+ if f.lower().endswith(('.mp4', '.avi', '.mov', '.mkv', '.webm'))]
554
+
555
+ if not video_files:
556
+ print("No video files found in 'inputs' directory")
557
+ return
558
+
559
+ print(f"Found {len(video_files)} videos to process")
560
+ print(f"Will detect: {args.detect}")
561
+ if args.test:
562
+ print("Running in test mode - processing only first 3 seconds of each video")
563
+ print(f"Using FFmpeg preset: {args.preset}")
564
+ print(f"Grid size: {args.rows}x{args.cols}")
565
+ print(f"Box style: {args.box_style}")
566
+
567
+ success_count = 0
568
+ for video_file in video_files:
569
+ video_path = os.path.join(input_dir, video_file)
570
+ output_path = process_video(video_path, args.detect, test_mode=args.test, ffmpeg_preset=args.preset,
571
+ rows=args.rows, cols=args.cols, box_style=args.box_style)
572
+ if output_path:
573
+ success_count += 1
574
+
575
+ print(f"\nProcessing complete. Successfully processed {success_count} out of {len(video_files)} videos.")
576
+
577
+ if __name__ == "__main__":
578
+ main()
packages.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ libvips
2
+ ffmpeg
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio>=4.0.0
2
+ torch
3
+ transformers
4
+ opencv-python
5
+ pillow
6
+ numpy
7
+ tqdm
8
+ ffmpeg-python
9
+ einops
10
+ pyvips
11
+ accelerate