Jimmi42 commited on
Commit
9c37045
Β·
1 Parent(s): d98e00b

Add Qwen2.5-Omni multimodal demo with working text, image, and audio processing

Browse files
Files changed (3) hide show
  1. README.md +118 -5
  2. app.py +398 -0
  3. requirements.txt +23 -0
README.md CHANGED
@@ -1,12 +1,125 @@
1
  ---
2
- title: Qwen2 5 Omni Multimodal Demo
3
- emoji: 🐠
4
- colorFrom: pink
5
- colorTo: yellow
6
  sdk: gradio
7
  sdk_version: 5.33.0
8
  app_file: app.py
9
  pinned: false
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Qwen2.5-Omni Multimodal Demo
3
+ emoji: πŸ€–
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
  sdk_version: 5.33.0
8
  app_file: app.py
9
  pinned: false
10
+ license: apache-2.0
11
  ---
12
 
13
+ # πŸ€– Qwen2.5-Omni Complete Multimodal Demo
14
+
15
+ A comprehensive Gradio-based web interface for the **Qwen2.5-Omni-3B** multimodal AI model, showcasing advanced text, image, and audio understanding capabilities.
16
+
17
+ ## 🌟 Features
18
+
19
+ ### Core Capabilities
20
+ - **πŸ’¬ Text Conversations**: Natural language processing with customizable system prompts
21
+ - **πŸ–ΌοΈ Image Analysis**: Visual understanding and detailed image descriptions
22
+ - **🎡 Audio Processing**: Speech recognition and audio content understanding
23
+ - **🌟 Multimodal Chat**: Combined text, image, and audio input processing
24
+ - **🧠 Memory Management**: Optimized resource usage with automatic cleanup
25
+ - **⚑ Hardware Acceleration**: Support for Apple Silicon (MPS) and CPU fallback
26
+
27
+ ### Technical Features
28
+ - **bfloat16 Precision**: Memory-efficient model loading
29
+ - **Streaming Responses**: Real-time text generation
30
+ - **Image Resizing**: Automatic image optimization to prevent memory issues
31
+ - **Resource Cleanup**: Automatic cleanup on interruption
32
+ - **Cross-Platform**: Works on Apple Silicon (MPS) and CPU
33
+
34
+ ## πŸš€ Quick Start
35
+
36
+ 1. **Load the Model**: Click "πŸ”„ Load Model" to initialize Qwen2.5-Omni-3B
37
+ 2. **Choose Your Tab**: Select the appropriate tab for your use case
38
+ 3. **Start Exploring**: Experiment with different combinations of inputs!
39
+
40
+ ## πŸ’‘ Usage Examples
41
+
42
+ ### πŸ’¬ Text Chat
43
+ Perfect for general conversations, coding help, and creative writing:
44
+ - Ask questions about any topic
45
+ - Get coding assistance
46
+ - Creative writing and brainstorming
47
+ - Educational content
48
+
49
+ ### πŸ–ΌοΈ Image Analysis
50
+ Upload images and ask questions about them:
51
+ - "What do you see in this image?"
52
+ - "Describe the colors and composition"
53
+ - "What's the mood or atmosphere?"
54
+ - "Read any text visible in the image"
55
+
56
+ ### 🎡 Audio Processing
57
+ Upload audio files for transcription and understanding:
58
+ - Speech-to-text transcription
59
+ - Audio content analysis
60
+ - Language detection
61
+ - Sentiment analysis of spoken content
62
+
63
+ ### 🌟 Multimodal Chat
64
+ Combine multiple input types for richer interactions:
65
+ - Upload an image + audio and ask comparative questions
66
+ - Describe what you see and hear simultaneously
67
+ - Create educational content with multiple media types
68
+ - Accessibility applications
69
+
70
+ ## βš™οΈ Configuration Options
71
+
72
+ ### Model Settings
73
+ - **Temperature**: Controls creativity (0.1 = focused, 2.0 = creative)
74
+ - **Max Tokens**: Response length limit (10-500)
75
+ - **System Prompt**: Customize AI behavior and personality
76
+
77
+ ### Performance Tips
78
+ 1. **Images**: Use clear, well-lit images under 2MB for best results
79
+ 2. **Audio**: Clean audio without background noise works best
80
+ 3. **Text**: Be specific in your questions for better responses
81
+ 4. **Multimodal**: Combine different input types for richer interactions
82
+
83
+ ## πŸ”§ Technical Details
84
+
85
+ ### Model Information
86
+ - **Base Model**: Qwen2.5-Omni-3B (3 Billion parameters)
87
+ - **Precision**: bfloat16 for memory efficiency
88
+ - **Acceleration**: Apple Silicon MPS or CPU fallback
89
+ - **Memory Usage**: ~6-8GB for optimal performance
90
+
91
+ ### Supported Formats
92
+ - **Images**: PNG, JPEG, WebP, and most common formats
93
+ - **Audio**: WAV, MP3, M4A, and other common audio formats
94
+ - **Text**: UTF-8 text input with emoji support
95
+
96
+ ## πŸ› οΈ Known Limitations
97
+
98
+ - **Audio Output**: No speech synthesis (input processing only)
99
+ - **Model Size**: Limited to 3B parameter model for optimal performance
100
+ - **Processing Time**: CPU inference will be slower than MPS acceleration
101
+
102
+ ## 🀝 About This Demo
103
+
104
+ This demo showcases the multimodal capabilities of Alibaba's Qwen2.5-Omni model, demonstrating how modern AI can understand and reason across different types of media. The interface is optimized for:
105
+
106
+ - **Ease of Use**: Simple, intuitive interface for all users
107
+ - **Performance**: Efficient memory management and fast responses
108
+ - **Accessibility**: Cross-platform compatibility with graceful fallbacks
109
+ - **Education**: Perfect for learning about multimodal AI capabilities
110
+
111
+ ## πŸ“ Credits
112
+
113
+ - **Model**: [Qwen2.5-Omni-3B](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) by Alibaba's Qwen Team
114
+ - **Interface**: Built with [Gradio](https://gradio.app/)
115
+ - **Optimization**: Apple Silicon MPS acceleration with CPU fallback
116
+
117
+ ## πŸ”— Related Links
118
+
119
+ - [Qwen2.5-Omni Model Card](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)
120
+ - [Transformers Library](https://huggingface.co/docs/transformers)
121
+ - [Gradio Documentation](https://gradio.app/docs/)
122
+
123
+ ---
124
+
125
+ **Try the demo above to experience the power of multimodal AI! πŸš€**
app.py ADDED
@@ -0,0 +1,398 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Qwen2.5-Omni Complete Multimodal Demo
4
+ A comprehensive Gradio interface for the Qwen2.5-Omni-3B multimodal AI model
5
+ Optimized for Apple Silicon (MPS) with efficient memory management
6
+ """
7
+
8
+ import os
9
+ import gc
10
+ import sys
11
+ import time
12
+ import signal
13
+ import warnings
14
+ from typing import List, Dict, Any, Optional, Tuple, Union
15
+ import tempfile
16
+ import soundfile as sf
17
+
18
+ # Suppress warnings for cleaner output
19
+ warnings.filterwarnings("ignore", category=FutureWarning)
20
+ warnings.filterwarnings("ignore", category=UserWarning)
21
+
22
+ import torch
23
+ import numpy as np
24
+ import gradio as gr
25
+ from PIL import Image
26
+
27
+ # Global variables for model and processor
28
+ model = None
29
+ processor = None
30
+ device = None
31
+
32
+ def cleanup_resources():
33
+ """Clean up model and free memory"""
34
+ global model, processor
35
+
36
+ try:
37
+ if model is not None:
38
+ del model
39
+ model = None
40
+ if processor is not None:
41
+ del processor
42
+ processor = None
43
+
44
+ # Force garbage collection
45
+ gc.collect()
46
+
47
+ # Clear CUDA/MPS cache if available
48
+ if torch.cuda.is_available():
49
+ torch.cuda.empty_cache()
50
+ elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
51
+ torch.mps.empty_cache()
52
+
53
+ print("βœ… Resources cleaned up successfully")
54
+
55
+ except Exception as e:
56
+ print(f"⚠️ Warning during cleanup: {e}")
57
+
58
+ def signal_handler(signum, frame):
59
+ """Handle interrupt signals gracefully"""
60
+ print("\nπŸ›‘ Interrupt received, cleaning up...")
61
+ cleanup_resources()
62
+ sys.exit(0)
63
+
64
+ # Register signal handlers
65
+ signal.signal(signal.SIGINT, signal_handler)
66
+ signal.signal(signal.SIGTERM, signal_handler)
67
+
68
+ def load_model():
69
+ """Load the Qwen2.5-Omni model and processor"""
70
+ global model, processor, device
71
+
72
+ if model is not None:
73
+ return "βœ… Model already loaded!"
74
+
75
+ try:
76
+ # Check device
77
+ if torch.backends.mps.is_available():
78
+ device = torch.device("mps")
79
+ device_info = "πŸš€ Using Apple Silicon MPS acceleration"
80
+ else:
81
+ device = torch.device("cpu")
82
+ device_info = "⚠️ Using CPU (MPS not available)"
83
+
84
+ # Import the specific Qwen2.5-Omni classes
85
+ from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
86
+
87
+ # Load processor with optimizations
88
+ processor = Qwen2_5OmniProcessor.from_pretrained(
89
+ "Qwen/Qwen2.5-Omni-3B",
90
+ trust_remote_code=True,
91
+ use_fast=True # Use fast tokenizer if available
92
+ )
93
+
94
+ # Load model with memory-efficient settings - keep bfloat16 for all functionalities
95
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
96
+ "Qwen/Qwen2.5-Omni-3B",
97
+ torch_dtype=torch.bfloat16,
98
+ trust_remote_code=True,
99
+ device_map="auto" if device.type != "mps" else None,
100
+ low_cpu_mem_usage=True,
101
+ use_safetensors=True,
102
+ attn_implementation="sdpa"
103
+ )
104
+
105
+ # Immediately disable the audio generation module to prevent any initialization overhead
106
+ model.disable_talker()
107
+ print("🎀 Talker module disabled immediately after loading to optimize performance")
108
+
109
+ # Explicitly move to device for MPS while keeping bfloat16
110
+ if device.type == "mps":
111
+ model = model.to(device=device, dtype=torch.bfloat16)
112
+
113
+ print(f"πŸ”§ Model loaded with dtype: bfloat16 (memory efficient)")
114
+
115
+ # Clear any cached memory after loading
116
+ gc.collect()
117
+ gc.collect() # Run twice for good measure
118
+ if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
119
+ torch.mps.empty_cache()
120
+
121
+ return f"βœ… Model loaded successfully!\n{device_info}\nDevice: {device}"
122
+
123
+ except Exception as e:
124
+ return f"❌ Error loading model: {str(e)}"
125
+
126
+ def text_chat(message, history, system_prompt, temperature, max_tokens):
127
+ """Handle text-only conversations correctly."""
128
+ if model is None or processor is None:
129
+ history.append((message, "❌ Error: Model is not loaded. Please load the model first."))
130
+ return history, ""
131
+
132
+ if not message or not message.strip():
133
+ return history, ""
134
+
135
+ try:
136
+ conversation = []
137
+ if system_prompt and system_prompt.strip():
138
+ conversation.append({"role": "system", "content": [{"type": "text", "text": system_prompt}]})
139
+
140
+ # Correctly process history for the model
141
+ for user_msg, assistant_msg in history:
142
+ if user_msg:
143
+ conversation.append({"role": "user", "content": [{"type": "text", "text": user_msg}]})
144
+ if assistant_msg:
145
+ # Avoid adding error messages to the model's context
146
+ if not assistant_msg.startswith("οΏ½οΏ½ Error:"):
147
+ conversation.append({"role": "assistant", "content": [{"type": "text", "text": assistant_msg}]})
148
+
149
+ conversation.append({"role": "user", "content": [{"type": "text", "text": message}]})
150
+
151
+ text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
152
+ inputs = processor(text=text, return_tensors="pt", padding=True).to(device)
153
+
154
+ with torch.no_grad():
155
+ generated_ids = model.generate(
156
+ **inputs,
157
+ max_new_tokens=max_tokens,
158
+ temperature=temperature,
159
+ do_sample=True,
160
+ pad_token_id=processor.tokenizer.eos_token_id
161
+ )
162
+
163
+ input_token_len = inputs["input_ids"].shape[1]
164
+ response_ids = generated_ids[:, input_token_len:]
165
+ response = processor.batch_decode(response_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
166
+
167
+ history.append((message, response))
168
+ return history, ""
169
+
170
+ except Exception as e:
171
+ import traceback
172
+ traceback.print_exc()
173
+ error_message = f"❌ Error in text chat: {str(e)}"
174
+ history.append((message, error_message))
175
+ return history, ""
176
+
177
+ def multimodal_chat(message, image, audio, history, system_prompt, temperature, max_tokens):
178
+ """
179
+ Handle multimodal conversations (text, image, and audio) using the correct
180
+ processor.apply_chat_template method as per the official documentation.
181
+ """
182
+ global model, processor, device
183
+ if model is None or processor is None:
184
+ history.append((message, "❌ Error: Model is not loaded. Please load the model first."))
185
+ return history, ""
186
+
187
+ if not message.strip() and image is None and audio is None:
188
+ history.append(("", "Please provide an input (text, image, or audio)."))
189
+ return history, ""
190
+
191
+ # --- Create a temporary directory for media files ---
192
+ temp_dir = tempfile.mkdtemp()
193
+
194
+ try:
195
+ # --- Build the conversation history in the required format ---
196
+ conversation = []
197
+ if system_prompt and system_prompt.strip():
198
+ conversation.append({"role": "system", "content": [{"type": "text", "text": system_prompt}]})
199
+
200
+ # Process Gradio history into the conversation format
201
+ for user_turn, bot_turn in history:
202
+ # For simplicity, we only process the text part of the history.
203
+ # A more robust solution would parse the [Image] and [Audio] tags
204
+ # and reconstruct the full multimodal history.
205
+ if user_turn:
206
+ conversation.append({"role": "user", "content": [{"type": "text", "text": user_turn.replace("[Image]", "").replace("[Audio]", "").strip()}]})
207
+ if bot_turn and not bot_turn.startswith("❌ Error:"):
208
+ conversation.append({"role": "assistant", "content": [{"type": "text", "text": bot_turn}]})
209
+
210
+
211
+ # --- Prepare the current user's turn ---
212
+ current_content = []
213
+ user_message_for_history = ""
214
+
215
+ # Process text
216
+ if message and message.strip():
217
+ current_content.append({"type": "text", "text": message})
218
+ user_message_for_history += message
219
+
220
+ # Process image
221
+ if image is not None:
222
+ # --- FIX: Resize large images to prevent OOM errors ---
223
+ MAX_PIXELS = 1024 * 1024 # 1 megapixel
224
+ if image.width * image.height > MAX_PIXELS:
225
+ image.thumbnail((1024, 1024), Image.Resampling.LANCZOS)
226
+
227
+ temp_image_path = os.path.join(temp_dir, "temp_image.png")
228
+ image.save(temp_image_path)
229
+ current_content.append({"type": "image", "image": temp_image_path})
230
+ user_message_for_history += " [Image]"
231
+
232
+ # Process audio
233
+ if audio is not None:
234
+ sample_rate, audio_data = audio
235
+ temp_audio_path = os.path.join(temp_dir, "temp_audio.wav")
236
+ sf.write(temp_audio_path, audio_data, sample_rate)
237
+ current_content.append({"type": "audio", "audio": temp_audio_path})
238
+ user_message_for_history += " [Audio]"
239
+
240
+ if not current_content:
241
+ history.append(("", "Please provide some input."))
242
+ return history, ""
243
+
244
+ conversation.append({"role": "user", "content": current_content})
245
+
246
+ # --- Use `apply_chat_template` as per the documentation ---
247
+ # This is the single, correct way to process all modalities.
248
+ inputs = processor.apply_chat_template(
249
+ conversation,
250
+ add_generation_prompt=True,
251
+ tokenize=True,
252
+ return_dict=True,
253
+ return_tensors="pt",
254
+ padding=True,
255
+ ).to(device)
256
+
257
+ # --- Generation ---
258
+ with torch.no_grad():
259
+ # Note: The model's generate function does not return audio directly in this setup
260
+ # We are focusing on getting the text response right first.
261
+ generated_ids = model.generate(
262
+ **inputs,
263
+ max_new_tokens=max_tokens,
264
+ temperature=temperature,
265
+ do_sample=True,
266
+ pad_token_id=processor.tokenizer.eos_token_id,
267
+ # return_audio=False # This might be needed if audio output is enabled by default
268
+ )
269
+
270
+ # The generate call for the full Omni model might return a tuple (text_ids, audio_wav)
271
+ # We handle both cases to be safe.
272
+ if isinstance(generated_ids, tuple):
273
+ response_ids = generated_ids[0]
274
+ else:
275
+ response_ids = generated_ids
276
+
277
+ input_token_len = inputs["input_ids"].shape[1]
278
+ response_ids_decoded = response_ids[:, input_token_len:]
279
+ response = processor.batch_decode(response_ids_decoded, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
280
+
281
+ history.append((user_message_for_history.strip(), response))
282
+ return history, ""
283
+
284
+ except Exception as e:
285
+ import traceback
286
+ error_message = f"❌ Multimodal chat error: {traceback.format_exc()}"
287
+ print(error_message) # Print full traceback to console for debugging
288
+ history.append((message, f"❌ Error: {e}"))
289
+ return history, ""
290
+ finally:
291
+ # --- Clean up temporary files ---
292
+ if os.path.exists(temp_dir):
293
+ import shutil
294
+ shutil.rmtree(temp_dir)
295
+
296
+ def clear_history():
297
+ """Clear chat history"""
298
+ return []
299
+
300
+ def clear_model_cache():
301
+ """Clear model cache and free memory"""
302
+ global model, processor
303
+ try:
304
+ cleanup_resources()
305
+
306
+ # Clear additional caches
307
+ if torch.cuda.is_available():
308
+ torch.cuda.empty_cache()
309
+ elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
310
+ torch.mps.empty_cache()
311
+
312
+ return "βœ… Cache cleared successfully! Click 'Load Model' to reload."
313
+ except Exception as e:
314
+ return f"❌ Error clearing cache: {str(e)}"
315
+
316
+ def create_interface():
317
+ """Create the complete Gradio interface with the fix."""
318
+ with gr.Blocks(title="Qwen2.5-Omni Multimodal Demo", theme=gr.themes.Soft()) as demo:
319
+ gr.Markdown("""
320
+ # πŸ€– Qwen2.5-Omni Complete Multimodal Demo
321
+ A comprehensive and corrected Gradio interface for the Qwen2.5-Omni-3B model.
322
+ """)
323
+
324
+ with gr.Row():
325
+ with gr.Column(scale=2):
326
+ load_btn = gr.Button("πŸ”„ Load Model", variant="primary")
327
+ with gr.Column(scale=2):
328
+ cache_clear_btn = gr.Button("🧹 Clear Cache", variant="secondary")
329
+ with gr.Column(scale=3):
330
+ model_status = gr.Textbox(label="Model Status", value="Model not loaded", interactive=False)
331
+
332
+ load_btn.click(load_model, outputs=model_status)
333
+ cache_clear_btn.click(clear_model_cache, outputs=model_status)
334
+
335
+ with gr.Tabs():
336
+ with gr.Tab("πŸ’¬ Text Chat"):
337
+ text_chatbot = gr.Chatbot(label="Conversation", height=450)
338
+ with gr.Row():
339
+ text_msg = gr.Textbox(label="Your message", placeholder="Type your message...", scale=4, container=False)
340
+ text_send = gr.Button("Send", variant="primary", scale=1)
341
+ with gr.Row():
342
+ text_clear = gr.Button("Clear History")
343
+ with gr.Accordion("Settings", open=False):
344
+ text_system = gr.Textbox(label="System Prompt", value="You are a helpful AI assistant.")
345
+ text_temp = gr.Slider(0.1, 1.5, value=0.7, label="Temperature")
346
+ text_max_tokens = gr.Slider(50, 1000, value=500, label="Max New Tokens", step=50)
347
+
348
+ text_send.click(text_chat, inputs=[text_msg, text_chatbot, text_system, text_temp, text_max_tokens], outputs=[text_chatbot, text_msg])
349
+ text_msg.submit(text_chat, inputs=[text_msg, text_chatbot, text_system, text_temp, text_max_tokens], outputs=[text_chatbot, text_msg])
350
+ text_clear.click(clear_history, outputs=text_chatbot)
351
+
352
+ with gr.Tab("🌟 Multimodal Chat"):
353
+ multi_chatbot = gr.Chatbot(label="Multimodal Conversation", height=450)
354
+ multi_text = gr.Textbox(label="Text Message (optional)", placeholder="Describe what you want to know...", scale=4, container=False)
355
+ with gr.Row():
356
+ multi_image = gr.Image(label="Upload Image (optional)", type="pil")
357
+ multi_audio = gr.Audio(label="Upload Audio (optional)", type="numpy")
358
+ with gr.Row():
359
+ multi_send = gr.Button("Send Multimodal Input", variant="primary")
360
+ multi_clear = gr.Button("Clear History")
361
+ with gr.Accordion("Settings", open=False):
362
+ multi_system = gr.Textbox(label="System Prompt", value="You are Qwen, capable of understanding images, audio, and text.")
363
+ multi_temp = gr.Slider(0.1, 1.5, value=0.7, label="Temperature")
364
+ multi_max_tokens = gr.Slider(50, 1000, value=500, label="Max New Tokens", step=50)
365
+
366
+ multi_send.click(multimodal_chat, inputs=[multi_text, multi_image, multi_audio, multi_chatbot, multi_system, multi_temp, multi_max_tokens], outputs=[multi_chatbot, multi_text])
367
+ multi_clear.click(clear_history, outputs=multi_chatbot)
368
+
369
+ with gr.Tab("ℹ️ Model Info"):
370
+ # Placeholder for model info content
371
+ gr.Markdown("Model information will be displayed here.")
372
+
373
+ return demo
374
+
375
+ if __name__ == "__main__":
376
+ try:
377
+ os.environ["TOKENIZERS_PARALLELISM"] = "false"
378
+ os.environ["OMP_NUM_THREADS"] = "1"
379
+
380
+ demo = create_interface()
381
+
382
+ print("πŸš€ Starting Qwen2.5-Omni Gradio Demo...")
383
+ print("πŸ“‹ Memory management optimizations enabled")
384
+ print("πŸ”— Access the interface at: http://localhost:7860")
385
+
386
+ demo.launch(
387
+ server_name="0.0.0.0",
388
+ server_port=7860,
389
+ share=False,
390
+ show_error=True,
391
+ quiet=False
392
+ )
393
+ except KeyboardInterrupt:
394
+ print("\nπŸ›‘ Shutting down gracefully...")
395
+ cleanup_resources()
396
+ except Exception as e:
397
+ print(f"❌ Error starting demo: {e}")
398
+ cleanup_resources()
requirements.txt ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core ML/AI libraries
2
+ torch>=2.7.0
3
+ transformers>=4.52.4
4
+ accelerate>=1.7.0
5
+ bitsandbytes>=0.42.0
6
+
7
+ # Audio processing
8
+ librosa>=0.11.0
9
+ soundfile>=0.13.1
10
+ pydub>=0.25.1
11
+
12
+ # Web interface
13
+ gradio>=5.33.0
14
+
15
+ # Utilities
16
+ numpy>=1.24.0
17
+ pillow>=11.2.1
18
+ pandas>=2.3.0
19
+
20
+ # Additional dependencies for model support
21
+ sentencepiece>=0.2.0
22
+ safetensors>=0.5.0
23
+ huggingface-hub>=0.32.0