**Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs**
Baichuan-Omni-1.5 🤗 | Baichuan-Omni-1.5-Base 🤗 |Github 📖 | Report 📖
OpenMM-Medical 🤗 | OpenAudioBench 🤗
## Baichuan-Omni-1.5
The Baichuan-Omni-1.5 is the latest, top-performing model in the Baichuan-omni series. This model is trained and inferred in an end-to-end manner. Compared with Baichuan-omni, this model has significant improvements in text/image/audio/video understanding and text/audio generation, and supports new features such as controllable real-time voice conversations and multi-modal real-time interactions. The main features of Baichuan-Omni-1.5 include:
- 🔥 **Possess Multimodal Understanding and Interaction Capabilities.**
Baichuan-Omni-1.5 not only supports images, videos, text, and audio as input, and generates high-quality text and voice output, but also **supports continuous video and audio streaming, and real-time voice interaction with users**. In OminiBench, a comprehensive evaluation benchmark for omnimodal understanding, Baichuan-Omni-1.5 has achieved the first-class level of the open source community and surpassed GPT-4o-mini.
- 💪 **Strong Visual Capability.**
Baichuan-Omni-1.5 has an average score of 73.3 on the OpenCompass list (comprehensive 10 mainstream multimodal evaluation benchmarks). **With the size of 7B, it surpasses mainstream commercial closed-source multimodal large models such as GPT-4o-mini, Gemini 1.5 Pro and Claude 3.5 Sonnet in single-image understanding**. In addition, its video understanding performance is also better than GPT-4V and Claude 3.5 Sonnet and open source omnimodal models.
- 🚀 **Leading Medical Image Understanding Capabilities.**
Baichuan-Omni-1.5 achieved the best performance on GMAI-MMBench and Openmm-Medical. Using only 7B LLM, the average score exceeded Qwen2-VL-72b by 3%, i.e. 80.7% v.s 83.8%.
- 🎙 **Excellent Voice Capabilities.**
Baichuan-Omni-1.5 **supports high-quality, controllable voice bilingual real-time conversations in Chinese and English**. It **outperforms GPT-4o-realtime** in speech understanding tasks (such as ASR and STT, etc.), and demonstrates **the highest speech generation performance among open source models** in semantic and acoustic evaluation of voice conversations.
- 🎬 **Powerful Real-world Understanding and Other Features.**
Baichuan-Omni-1.5 further optimizes the many visual understanding capabilities of Baichuan-omni. It can process images of any aspect ratio and up to 1.8 million pixels (such as 1344x1344). It scored 68.8 points on RealWorldQA, **surpassing commercial closed-source models such as GPT-4o-mini** and recently open-sourced omnimodal models. It scored 85.6/83.6 on the English/Chinese evaluation subsets of MMBench, respectively, which is also in the first echelon of models with the same size.
- 💫 **Provides [🤗 Base Model](https://huggingface.co/baichuan-inc/Baichuan-Omni-1d5-Base) and [🤗 Instruct Model](https://huggingface.co/baichuan-inc/Baichuan-Omni-1d5).**
Baichuan-Omni-1.5-Base is a high-performance foundational omni-modal model in the industry. Based on the powerful base, Baichuan-Omni-1.5 employs high-quality omnimodal alignment data to perform end-to-end multimodal instruction data training.
**Model Architecture**