Jimmi42
commited on
Commit
Β·
9c37045
1
Parent(s):
d98e00b
Add Qwen2.5-Omni multimodal demo with working text, image, and audio processing
Browse files- README.md +118 -5
- app.py +398 -0
- requirements.txt +23 -0
README.md
CHANGED
@@ -1,12 +1,125 @@
|
|
1 |
---
|
2 |
-
title: Qwen2
|
3 |
-
emoji:
|
4 |
-
colorFrom:
|
5 |
-
colorTo:
|
6 |
sdk: gradio
|
7 |
sdk_version: 5.33.0
|
8 |
app_file: app.py
|
9 |
pinned: false
|
|
|
10 |
---
|
11 |
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
+
title: Qwen2.5-Omni Multimodal Demo
|
3 |
+
emoji: π€
|
4 |
+
colorFrom: blue
|
5 |
+
colorTo: purple
|
6 |
sdk: gradio
|
7 |
sdk_version: 5.33.0
|
8 |
app_file: app.py
|
9 |
pinned: false
|
10 |
+
license: apache-2.0
|
11 |
---
|
12 |
|
13 |
+
# π€ Qwen2.5-Omni Complete Multimodal Demo
|
14 |
+
|
15 |
+
A comprehensive Gradio-based web interface for the **Qwen2.5-Omni-3B** multimodal AI model, showcasing advanced text, image, and audio understanding capabilities.
|
16 |
+
|
17 |
+
## π Features
|
18 |
+
|
19 |
+
### Core Capabilities
|
20 |
+
- **π¬ Text Conversations**: Natural language processing with customizable system prompts
|
21 |
+
- **πΌοΈ Image Analysis**: Visual understanding and detailed image descriptions
|
22 |
+
- **π΅ Audio Processing**: Speech recognition and audio content understanding
|
23 |
+
- **π Multimodal Chat**: Combined text, image, and audio input processing
|
24 |
+
- **π§ Memory Management**: Optimized resource usage with automatic cleanup
|
25 |
+
- **β‘ Hardware Acceleration**: Support for Apple Silicon (MPS) and CPU fallback
|
26 |
+
|
27 |
+
### Technical Features
|
28 |
+
- **bfloat16 Precision**: Memory-efficient model loading
|
29 |
+
- **Streaming Responses**: Real-time text generation
|
30 |
+
- **Image Resizing**: Automatic image optimization to prevent memory issues
|
31 |
+
- **Resource Cleanup**: Automatic cleanup on interruption
|
32 |
+
- **Cross-Platform**: Works on Apple Silicon (MPS) and CPU
|
33 |
+
|
34 |
+
## π Quick Start
|
35 |
+
|
36 |
+
1. **Load the Model**: Click "π Load Model" to initialize Qwen2.5-Omni-3B
|
37 |
+
2. **Choose Your Tab**: Select the appropriate tab for your use case
|
38 |
+
3. **Start Exploring**: Experiment with different combinations of inputs!
|
39 |
+
|
40 |
+
## π‘ Usage Examples
|
41 |
+
|
42 |
+
### π¬ Text Chat
|
43 |
+
Perfect for general conversations, coding help, and creative writing:
|
44 |
+
- Ask questions about any topic
|
45 |
+
- Get coding assistance
|
46 |
+
- Creative writing and brainstorming
|
47 |
+
- Educational content
|
48 |
+
|
49 |
+
### πΌοΈ Image Analysis
|
50 |
+
Upload images and ask questions about them:
|
51 |
+
- "What do you see in this image?"
|
52 |
+
- "Describe the colors and composition"
|
53 |
+
- "What's the mood or atmosphere?"
|
54 |
+
- "Read any text visible in the image"
|
55 |
+
|
56 |
+
### π΅ Audio Processing
|
57 |
+
Upload audio files for transcription and understanding:
|
58 |
+
- Speech-to-text transcription
|
59 |
+
- Audio content analysis
|
60 |
+
- Language detection
|
61 |
+
- Sentiment analysis of spoken content
|
62 |
+
|
63 |
+
### π Multimodal Chat
|
64 |
+
Combine multiple input types for richer interactions:
|
65 |
+
- Upload an image + audio and ask comparative questions
|
66 |
+
- Describe what you see and hear simultaneously
|
67 |
+
- Create educational content with multiple media types
|
68 |
+
- Accessibility applications
|
69 |
+
|
70 |
+
## βοΈ Configuration Options
|
71 |
+
|
72 |
+
### Model Settings
|
73 |
+
- **Temperature**: Controls creativity (0.1 = focused, 2.0 = creative)
|
74 |
+
- **Max Tokens**: Response length limit (10-500)
|
75 |
+
- **System Prompt**: Customize AI behavior and personality
|
76 |
+
|
77 |
+
### Performance Tips
|
78 |
+
1. **Images**: Use clear, well-lit images under 2MB for best results
|
79 |
+
2. **Audio**: Clean audio without background noise works best
|
80 |
+
3. **Text**: Be specific in your questions for better responses
|
81 |
+
4. **Multimodal**: Combine different input types for richer interactions
|
82 |
+
|
83 |
+
## π§ Technical Details
|
84 |
+
|
85 |
+
### Model Information
|
86 |
+
- **Base Model**: Qwen2.5-Omni-3B (3 Billion parameters)
|
87 |
+
- **Precision**: bfloat16 for memory efficiency
|
88 |
+
- **Acceleration**: Apple Silicon MPS or CPU fallback
|
89 |
+
- **Memory Usage**: ~6-8GB for optimal performance
|
90 |
+
|
91 |
+
### Supported Formats
|
92 |
+
- **Images**: PNG, JPEG, WebP, and most common formats
|
93 |
+
- **Audio**: WAV, MP3, M4A, and other common audio formats
|
94 |
+
- **Text**: UTF-8 text input with emoji support
|
95 |
+
|
96 |
+
## π οΈ Known Limitations
|
97 |
+
|
98 |
+
- **Audio Output**: No speech synthesis (input processing only)
|
99 |
+
- **Model Size**: Limited to 3B parameter model for optimal performance
|
100 |
+
- **Processing Time**: CPU inference will be slower than MPS acceleration
|
101 |
+
|
102 |
+
## π€ About This Demo
|
103 |
+
|
104 |
+
This demo showcases the multimodal capabilities of Alibaba's Qwen2.5-Omni model, demonstrating how modern AI can understand and reason across different types of media. The interface is optimized for:
|
105 |
+
|
106 |
+
- **Ease of Use**: Simple, intuitive interface for all users
|
107 |
+
- **Performance**: Efficient memory management and fast responses
|
108 |
+
- **Accessibility**: Cross-platform compatibility with graceful fallbacks
|
109 |
+
- **Education**: Perfect for learning about multimodal AI capabilities
|
110 |
+
|
111 |
+
## π Credits
|
112 |
+
|
113 |
+
- **Model**: [Qwen2.5-Omni-3B](https://huggingface.co/Qwen/Qwen2.5-Omni-3B) by Alibaba's Qwen Team
|
114 |
+
- **Interface**: Built with [Gradio](https://gradio.app/)
|
115 |
+
- **Optimization**: Apple Silicon MPS acceleration with CPU fallback
|
116 |
+
|
117 |
+
## π Related Links
|
118 |
+
|
119 |
+
- [Qwen2.5-Omni Model Card](https://huggingface.co/Qwen/Qwen2.5-Omni-3B)
|
120 |
+
- [Transformers Library](https://huggingface.co/docs/transformers)
|
121 |
+
- [Gradio Documentation](https://gradio.app/docs/)
|
122 |
+
|
123 |
+
---
|
124 |
+
|
125 |
+
**Try the demo above to experience the power of multimodal AI! π**
|
app.py
ADDED
@@ -0,0 +1,398 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
#!/usr/bin/env python3
|
2 |
+
"""
|
3 |
+
Qwen2.5-Omni Complete Multimodal Demo
|
4 |
+
A comprehensive Gradio interface for the Qwen2.5-Omni-3B multimodal AI model
|
5 |
+
Optimized for Apple Silicon (MPS) with efficient memory management
|
6 |
+
"""
|
7 |
+
|
8 |
+
import os
|
9 |
+
import gc
|
10 |
+
import sys
|
11 |
+
import time
|
12 |
+
import signal
|
13 |
+
import warnings
|
14 |
+
from typing import List, Dict, Any, Optional, Tuple, Union
|
15 |
+
import tempfile
|
16 |
+
import soundfile as sf
|
17 |
+
|
18 |
+
# Suppress warnings for cleaner output
|
19 |
+
warnings.filterwarnings("ignore", category=FutureWarning)
|
20 |
+
warnings.filterwarnings("ignore", category=UserWarning)
|
21 |
+
|
22 |
+
import torch
|
23 |
+
import numpy as np
|
24 |
+
import gradio as gr
|
25 |
+
from PIL import Image
|
26 |
+
|
27 |
+
# Global variables for model and processor
|
28 |
+
model = None
|
29 |
+
processor = None
|
30 |
+
device = None
|
31 |
+
|
32 |
+
def cleanup_resources():
|
33 |
+
"""Clean up model and free memory"""
|
34 |
+
global model, processor
|
35 |
+
|
36 |
+
try:
|
37 |
+
if model is not None:
|
38 |
+
del model
|
39 |
+
model = None
|
40 |
+
if processor is not None:
|
41 |
+
del processor
|
42 |
+
processor = None
|
43 |
+
|
44 |
+
# Force garbage collection
|
45 |
+
gc.collect()
|
46 |
+
|
47 |
+
# Clear CUDA/MPS cache if available
|
48 |
+
if torch.cuda.is_available():
|
49 |
+
torch.cuda.empty_cache()
|
50 |
+
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
|
51 |
+
torch.mps.empty_cache()
|
52 |
+
|
53 |
+
print("β
Resources cleaned up successfully")
|
54 |
+
|
55 |
+
except Exception as e:
|
56 |
+
print(f"β οΈ Warning during cleanup: {e}")
|
57 |
+
|
58 |
+
def signal_handler(signum, frame):
|
59 |
+
"""Handle interrupt signals gracefully"""
|
60 |
+
print("\nπ Interrupt received, cleaning up...")
|
61 |
+
cleanup_resources()
|
62 |
+
sys.exit(0)
|
63 |
+
|
64 |
+
# Register signal handlers
|
65 |
+
signal.signal(signal.SIGINT, signal_handler)
|
66 |
+
signal.signal(signal.SIGTERM, signal_handler)
|
67 |
+
|
68 |
+
def load_model():
|
69 |
+
"""Load the Qwen2.5-Omni model and processor"""
|
70 |
+
global model, processor, device
|
71 |
+
|
72 |
+
if model is not None:
|
73 |
+
return "β
Model already loaded!"
|
74 |
+
|
75 |
+
try:
|
76 |
+
# Check device
|
77 |
+
if torch.backends.mps.is_available():
|
78 |
+
device = torch.device("mps")
|
79 |
+
device_info = "π Using Apple Silicon MPS acceleration"
|
80 |
+
else:
|
81 |
+
device = torch.device("cpu")
|
82 |
+
device_info = "β οΈ Using CPU (MPS not available)"
|
83 |
+
|
84 |
+
# Import the specific Qwen2.5-Omni classes
|
85 |
+
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
|
86 |
+
|
87 |
+
# Load processor with optimizations
|
88 |
+
processor = Qwen2_5OmniProcessor.from_pretrained(
|
89 |
+
"Qwen/Qwen2.5-Omni-3B",
|
90 |
+
trust_remote_code=True,
|
91 |
+
use_fast=True # Use fast tokenizer if available
|
92 |
+
)
|
93 |
+
|
94 |
+
# Load model with memory-efficient settings - keep bfloat16 for all functionalities
|
95 |
+
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
|
96 |
+
"Qwen/Qwen2.5-Omni-3B",
|
97 |
+
torch_dtype=torch.bfloat16,
|
98 |
+
trust_remote_code=True,
|
99 |
+
device_map="auto" if device.type != "mps" else None,
|
100 |
+
low_cpu_mem_usage=True,
|
101 |
+
use_safetensors=True,
|
102 |
+
attn_implementation="sdpa"
|
103 |
+
)
|
104 |
+
|
105 |
+
# Immediately disable the audio generation module to prevent any initialization overhead
|
106 |
+
model.disable_talker()
|
107 |
+
print("π€ Talker module disabled immediately after loading to optimize performance")
|
108 |
+
|
109 |
+
# Explicitly move to device for MPS while keeping bfloat16
|
110 |
+
if device.type == "mps":
|
111 |
+
model = model.to(device=device, dtype=torch.bfloat16)
|
112 |
+
|
113 |
+
print(f"π§ Model loaded with dtype: bfloat16 (memory efficient)")
|
114 |
+
|
115 |
+
# Clear any cached memory after loading
|
116 |
+
gc.collect()
|
117 |
+
gc.collect() # Run twice for good measure
|
118 |
+
if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
|
119 |
+
torch.mps.empty_cache()
|
120 |
+
|
121 |
+
return f"β
Model loaded successfully!\n{device_info}\nDevice: {device}"
|
122 |
+
|
123 |
+
except Exception as e:
|
124 |
+
return f"β Error loading model: {str(e)}"
|
125 |
+
|
126 |
+
def text_chat(message, history, system_prompt, temperature, max_tokens):
|
127 |
+
"""Handle text-only conversations correctly."""
|
128 |
+
if model is None or processor is None:
|
129 |
+
history.append((message, "β Error: Model is not loaded. Please load the model first."))
|
130 |
+
return history, ""
|
131 |
+
|
132 |
+
if not message or not message.strip():
|
133 |
+
return history, ""
|
134 |
+
|
135 |
+
try:
|
136 |
+
conversation = []
|
137 |
+
if system_prompt and system_prompt.strip():
|
138 |
+
conversation.append({"role": "system", "content": [{"type": "text", "text": system_prompt}]})
|
139 |
+
|
140 |
+
# Correctly process history for the model
|
141 |
+
for user_msg, assistant_msg in history:
|
142 |
+
if user_msg:
|
143 |
+
conversation.append({"role": "user", "content": [{"type": "text", "text": user_msg}]})
|
144 |
+
if assistant_msg:
|
145 |
+
# Avoid adding error messages to the model's context
|
146 |
+
if not assistant_msg.startswith("οΏ½οΏ½ Error:"):
|
147 |
+
conversation.append({"role": "assistant", "content": [{"type": "text", "text": assistant_msg}]})
|
148 |
+
|
149 |
+
conversation.append({"role": "user", "content": [{"type": "text", "text": message}]})
|
150 |
+
|
151 |
+
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
|
152 |
+
inputs = processor(text=text, return_tensors="pt", padding=True).to(device)
|
153 |
+
|
154 |
+
with torch.no_grad():
|
155 |
+
generated_ids = model.generate(
|
156 |
+
**inputs,
|
157 |
+
max_new_tokens=max_tokens,
|
158 |
+
temperature=temperature,
|
159 |
+
do_sample=True,
|
160 |
+
pad_token_id=processor.tokenizer.eos_token_id
|
161 |
+
)
|
162 |
+
|
163 |
+
input_token_len = inputs["input_ids"].shape[1]
|
164 |
+
response_ids = generated_ids[:, input_token_len:]
|
165 |
+
response = processor.batch_decode(response_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
166 |
+
|
167 |
+
history.append((message, response))
|
168 |
+
return history, ""
|
169 |
+
|
170 |
+
except Exception as e:
|
171 |
+
import traceback
|
172 |
+
traceback.print_exc()
|
173 |
+
error_message = f"β Error in text chat: {str(e)}"
|
174 |
+
history.append((message, error_message))
|
175 |
+
return history, ""
|
176 |
+
|
177 |
+
def multimodal_chat(message, image, audio, history, system_prompt, temperature, max_tokens):
|
178 |
+
"""
|
179 |
+
Handle multimodal conversations (text, image, and audio) using the correct
|
180 |
+
processor.apply_chat_template method as per the official documentation.
|
181 |
+
"""
|
182 |
+
global model, processor, device
|
183 |
+
if model is None or processor is None:
|
184 |
+
history.append((message, "β Error: Model is not loaded. Please load the model first."))
|
185 |
+
return history, ""
|
186 |
+
|
187 |
+
if not message.strip() and image is None and audio is None:
|
188 |
+
history.append(("", "Please provide an input (text, image, or audio)."))
|
189 |
+
return history, ""
|
190 |
+
|
191 |
+
# --- Create a temporary directory for media files ---
|
192 |
+
temp_dir = tempfile.mkdtemp()
|
193 |
+
|
194 |
+
try:
|
195 |
+
# --- Build the conversation history in the required format ---
|
196 |
+
conversation = []
|
197 |
+
if system_prompt and system_prompt.strip():
|
198 |
+
conversation.append({"role": "system", "content": [{"type": "text", "text": system_prompt}]})
|
199 |
+
|
200 |
+
# Process Gradio history into the conversation format
|
201 |
+
for user_turn, bot_turn in history:
|
202 |
+
# For simplicity, we only process the text part of the history.
|
203 |
+
# A more robust solution would parse the [Image] and [Audio] tags
|
204 |
+
# and reconstruct the full multimodal history.
|
205 |
+
if user_turn:
|
206 |
+
conversation.append({"role": "user", "content": [{"type": "text", "text": user_turn.replace("[Image]", "").replace("[Audio]", "").strip()}]})
|
207 |
+
if bot_turn and not bot_turn.startswith("β Error:"):
|
208 |
+
conversation.append({"role": "assistant", "content": [{"type": "text", "text": bot_turn}]})
|
209 |
+
|
210 |
+
|
211 |
+
# --- Prepare the current user's turn ---
|
212 |
+
current_content = []
|
213 |
+
user_message_for_history = ""
|
214 |
+
|
215 |
+
# Process text
|
216 |
+
if message and message.strip():
|
217 |
+
current_content.append({"type": "text", "text": message})
|
218 |
+
user_message_for_history += message
|
219 |
+
|
220 |
+
# Process image
|
221 |
+
if image is not None:
|
222 |
+
# --- FIX: Resize large images to prevent OOM errors ---
|
223 |
+
MAX_PIXELS = 1024 * 1024 # 1 megapixel
|
224 |
+
if image.width * image.height > MAX_PIXELS:
|
225 |
+
image.thumbnail((1024, 1024), Image.Resampling.LANCZOS)
|
226 |
+
|
227 |
+
temp_image_path = os.path.join(temp_dir, "temp_image.png")
|
228 |
+
image.save(temp_image_path)
|
229 |
+
current_content.append({"type": "image", "image": temp_image_path})
|
230 |
+
user_message_for_history += " [Image]"
|
231 |
+
|
232 |
+
# Process audio
|
233 |
+
if audio is not None:
|
234 |
+
sample_rate, audio_data = audio
|
235 |
+
temp_audio_path = os.path.join(temp_dir, "temp_audio.wav")
|
236 |
+
sf.write(temp_audio_path, audio_data, sample_rate)
|
237 |
+
current_content.append({"type": "audio", "audio": temp_audio_path})
|
238 |
+
user_message_for_history += " [Audio]"
|
239 |
+
|
240 |
+
if not current_content:
|
241 |
+
history.append(("", "Please provide some input."))
|
242 |
+
return history, ""
|
243 |
+
|
244 |
+
conversation.append({"role": "user", "content": current_content})
|
245 |
+
|
246 |
+
# --- Use `apply_chat_template` as per the documentation ---
|
247 |
+
# This is the single, correct way to process all modalities.
|
248 |
+
inputs = processor.apply_chat_template(
|
249 |
+
conversation,
|
250 |
+
add_generation_prompt=True,
|
251 |
+
tokenize=True,
|
252 |
+
return_dict=True,
|
253 |
+
return_tensors="pt",
|
254 |
+
padding=True,
|
255 |
+
).to(device)
|
256 |
+
|
257 |
+
# --- Generation ---
|
258 |
+
with torch.no_grad():
|
259 |
+
# Note: The model's generate function does not return audio directly in this setup
|
260 |
+
# We are focusing on getting the text response right first.
|
261 |
+
generated_ids = model.generate(
|
262 |
+
**inputs,
|
263 |
+
max_new_tokens=max_tokens,
|
264 |
+
temperature=temperature,
|
265 |
+
do_sample=True,
|
266 |
+
pad_token_id=processor.tokenizer.eos_token_id,
|
267 |
+
# return_audio=False # This might be needed if audio output is enabled by default
|
268 |
+
)
|
269 |
+
|
270 |
+
# The generate call for the full Omni model might return a tuple (text_ids, audio_wav)
|
271 |
+
# We handle both cases to be safe.
|
272 |
+
if isinstance(generated_ids, tuple):
|
273 |
+
response_ids = generated_ids[0]
|
274 |
+
else:
|
275 |
+
response_ids = generated_ids
|
276 |
+
|
277 |
+
input_token_len = inputs["input_ids"].shape[1]
|
278 |
+
response_ids_decoded = response_ids[:, input_token_len:]
|
279 |
+
response = processor.batch_decode(response_ids_decoded, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
|
280 |
+
|
281 |
+
history.append((user_message_for_history.strip(), response))
|
282 |
+
return history, ""
|
283 |
+
|
284 |
+
except Exception as e:
|
285 |
+
import traceback
|
286 |
+
error_message = f"β Multimodal chat error: {traceback.format_exc()}"
|
287 |
+
print(error_message) # Print full traceback to console for debugging
|
288 |
+
history.append((message, f"β Error: {e}"))
|
289 |
+
return history, ""
|
290 |
+
finally:
|
291 |
+
# --- Clean up temporary files ---
|
292 |
+
if os.path.exists(temp_dir):
|
293 |
+
import shutil
|
294 |
+
shutil.rmtree(temp_dir)
|
295 |
+
|
296 |
+
def clear_history():
|
297 |
+
"""Clear chat history"""
|
298 |
+
return []
|
299 |
+
|
300 |
+
def clear_model_cache():
|
301 |
+
"""Clear model cache and free memory"""
|
302 |
+
global model, processor
|
303 |
+
try:
|
304 |
+
cleanup_resources()
|
305 |
+
|
306 |
+
# Clear additional caches
|
307 |
+
if torch.cuda.is_available():
|
308 |
+
torch.cuda.empty_cache()
|
309 |
+
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
|
310 |
+
torch.mps.empty_cache()
|
311 |
+
|
312 |
+
return "β
Cache cleared successfully! Click 'Load Model' to reload."
|
313 |
+
except Exception as e:
|
314 |
+
return f"β Error clearing cache: {str(e)}"
|
315 |
+
|
316 |
+
def create_interface():
|
317 |
+
"""Create the complete Gradio interface with the fix."""
|
318 |
+
with gr.Blocks(title="Qwen2.5-Omni Multimodal Demo", theme=gr.themes.Soft()) as demo:
|
319 |
+
gr.Markdown("""
|
320 |
+
# π€ Qwen2.5-Omni Complete Multimodal Demo
|
321 |
+
A comprehensive and corrected Gradio interface for the Qwen2.5-Omni-3B model.
|
322 |
+
""")
|
323 |
+
|
324 |
+
with gr.Row():
|
325 |
+
with gr.Column(scale=2):
|
326 |
+
load_btn = gr.Button("π Load Model", variant="primary")
|
327 |
+
with gr.Column(scale=2):
|
328 |
+
cache_clear_btn = gr.Button("π§Ή Clear Cache", variant="secondary")
|
329 |
+
with gr.Column(scale=3):
|
330 |
+
model_status = gr.Textbox(label="Model Status", value="Model not loaded", interactive=False)
|
331 |
+
|
332 |
+
load_btn.click(load_model, outputs=model_status)
|
333 |
+
cache_clear_btn.click(clear_model_cache, outputs=model_status)
|
334 |
+
|
335 |
+
with gr.Tabs():
|
336 |
+
with gr.Tab("π¬ Text Chat"):
|
337 |
+
text_chatbot = gr.Chatbot(label="Conversation", height=450)
|
338 |
+
with gr.Row():
|
339 |
+
text_msg = gr.Textbox(label="Your message", placeholder="Type your message...", scale=4, container=False)
|
340 |
+
text_send = gr.Button("Send", variant="primary", scale=1)
|
341 |
+
with gr.Row():
|
342 |
+
text_clear = gr.Button("Clear History")
|
343 |
+
with gr.Accordion("Settings", open=False):
|
344 |
+
text_system = gr.Textbox(label="System Prompt", value="You are a helpful AI assistant.")
|
345 |
+
text_temp = gr.Slider(0.1, 1.5, value=0.7, label="Temperature")
|
346 |
+
text_max_tokens = gr.Slider(50, 1000, value=500, label="Max New Tokens", step=50)
|
347 |
+
|
348 |
+
text_send.click(text_chat, inputs=[text_msg, text_chatbot, text_system, text_temp, text_max_tokens], outputs=[text_chatbot, text_msg])
|
349 |
+
text_msg.submit(text_chat, inputs=[text_msg, text_chatbot, text_system, text_temp, text_max_tokens], outputs=[text_chatbot, text_msg])
|
350 |
+
text_clear.click(clear_history, outputs=text_chatbot)
|
351 |
+
|
352 |
+
with gr.Tab("π Multimodal Chat"):
|
353 |
+
multi_chatbot = gr.Chatbot(label="Multimodal Conversation", height=450)
|
354 |
+
multi_text = gr.Textbox(label="Text Message (optional)", placeholder="Describe what you want to know...", scale=4, container=False)
|
355 |
+
with gr.Row():
|
356 |
+
multi_image = gr.Image(label="Upload Image (optional)", type="pil")
|
357 |
+
multi_audio = gr.Audio(label="Upload Audio (optional)", type="numpy")
|
358 |
+
with gr.Row():
|
359 |
+
multi_send = gr.Button("Send Multimodal Input", variant="primary")
|
360 |
+
multi_clear = gr.Button("Clear History")
|
361 |
+
with gr.Accordion("Settings", open=False):
|
362 |
+
multi_system = gr.Textbox(label="System Prompt", value="You are Qwen, capable of understanding images, audio, and text.")
|
363 |
+
multi_temp = gr.Slider(0.1, 1.5, value=0.7, label="Temperature")
|
364 |
+
multi_max_tokens = gr.Slider(50, 1000, value=500, label="Max New Tokens", step=50)
|
365 |
+
|
366 |
+
multi_send.click(multimodal_chat, inputs=[multi_text, multi_image, multi_audio, multi_chatbot, multi_system, multi_temp, multi_max_tokens], outputs=[multi_chatbot, multi_text])
|
367 |
+
multi_clear.click(clear_history, outputs=multi_chatbot)
|
368 |
+
|
369 |
+
with gr.Tab("βΉοΈ Model Info"):
|
370 |
+
# Placeholder for model info content
|
371 |
+
gr.Markdown("Model information will be displayed here.")
|
372 |
+
|
373 |
+
return demo
|
374 |
+
|
375 |
+
if __name__ == "__main__":
|
376 |
+
try:
|
377 |
+
os.environ["TOKENIZERS_PARALLELISM"] = "false"
|
378 |
+
os.environ["OMP_NUM_THREADS"] = "1"
|
379 |
+
|
380 |
+
demo = create_interface()
|
381 |
+
|
382 |
+
print("π Starting Qwen2.5-Omni Gradio Demo...")
|
383 |
+
print("π Memory management optimizations enabled")
|
384 |
+
print("π Access the interface at: http://localhost:7860")
|
385 |
+
|
386 |
+
demo.launch(
|
387 |
+
server_name="0.0.0.0",
|
388 |
+
server_port=7860,
|
389 |
+
share=False,
|
390 |
+
show_error=True,
|
391 |
+
quiet=False
|
392 |
+
)
|
393 |
+
except KeyboardInterrupt:
|
394 |
+
print("\nπ Shutting down gracefully...")
|
395 |
+
cleanup_resources()
|
396 |
+
except Exception as e:
|
397 |
+
print(f"β Error starting demo: {e}")
|
398 |
+
cleanup_resources()
|
requirements.txt
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Core ML/AI libraries
|
2 |
+
torch>=2.7.0
|
3 |
+
transformers>=4.52.4
|
4 |
+
accelerate>=1.7.0
|
5 |
+
bitsandbytes>=0.42.0
|
6 |
+
|
7 |
+
# Audio processing
|
8 |
+
librosa>=0.11.0
|
9 |
+
soundfile>=0.13.1
|
10 |
+
pydub>=0.25.1
|
11 |
+
|
12 |
+
# Web interface
|
13 |
+
gradio>=5.33.0
|
14 |
+
|
15 |
+
# Utilities
|
16 |
+
numpy>=1.24.0
|
17 |
+
pillow>=11.2.1
|
18 |
+
pandas>=2.3.0
|
19 |
+
|
20 |
+
# Additional dependencies for model support
|
21 |
+
sentencepiece>=0.2.0
|
22 |
+
safetensors>=0.5.0
|
23 |
+
huggingface-hub>=0.32.0
|