--- license: apache-2.0 pipeline_tag: any-to-any ---

**Open-source Omni-modal Foundation Model Supporting Text, Image, Video, and Audio Inputs as Well as Text and Audio Outputs**

Baichuan-Omni-1.5 🤗 | Baichuan-Omni-1.5-Base 🤗 |Github 📖 | Report 📖

OpenMM-Medical 🤗 | OpenAudioBench 🤗

## Baichuan-Omni-1.5 The Baichuan-Omni-1.5 is the latest, top-performing model in the Baichuan-omni series. This model is trained and inferred in an end-to-end manner. Compared with Baichuan-omni, this model has significant improvements in text/image/audio/video understanding and text/audio generation, and supports new features such as controllable real-time voice conversations and multi-modal real-time interactions. The main features of Baichuan-Omni-1.5 include: - 🔥 **Possess Multimodal Understanding and Interaction Capabilities.** Baichuan-Omni-1.5 not only supports images, videos, text, and audio as input, and generates high-quality text and voice output, but also **supports continuous video and audio streaming, and real-time voice interaction with users**. In OminiBench, a comprehensive evaluation benchmark for omnimodal understanding, Baichuan-Omni-1.5 has achieved the first-class level of the open source community and surpassed GPT-4o-mini. - 💪 **Strong Visual Capability.** Baichuan-Omni-1.5 has an average score of 73.3 on the OpenCompass list (comprehensive 10 mainstream multimodal evaluation benchmarks). **With the size of 7B, it surpasses mainstream commercial closed-source multimodal large models such as GPT-4o-mini, Gemini 1.5 Pro and Claude 3.5 Sonnet in single-image understanding**. In addition, its video understanding performance is also better than GPT-4V and Claude 3.5 Sonnet and open source omnimodal models. - 🚀 **Leading Medical Image Understanding Capabilities.** Baichuan-Omni-1.5 achieved the best performance on GMAI-MMBench and Openmm-Medical. Using only 7B LLM, the average score exceeded Qwen2-VL-72b by 3%, i.e. 80.7% v.s 83.8%. - 🎙 **Excellent Voice Capabilities.** Baichuan-Omni-1.5 **supports high-quality, controllable voice bilingual real-time conversations in Chinese and English**. It **outperforms GPT-4o-realtime** in speech understanding tasks (such as ASR and STT, etc.), and demonstrates **the highest speech generation performance among open source models** in semantic and acoustic evaluation of voice conversations. - 🎬 **Powerful Real-world Understanding and Other Features.** Baichuan-Omni-1.5 further optimizes the many visual understanding capabilities of Baichuan-omni. It can process images of any aspect ratio and up to 1.8 million pixels (such as 1344x1344). It scored 68.8 points on RealWorldQA, **surpassing commercial closed-source models such as GPT-4o-mini** and recently open-sourced omnimodal models. It scored 85.6/83.6 on the English/Chinese evaluation subsets of MMBench, respectively, which is also in the first echelon of models with the same size. - 💫 **Provides [🤗 Base Model](https://huggingface.co/baichuan-inc/Baichuan-Omni-1d5-Base) and [🤗 Instruct Model](https://huggingface.co/baichuan-inc/Baichuan-Omni-1d5).** Baichuan-Omni-1.5-Base is a high-performance foundational omni-modal model in the industry. Based on the powerful base, Baichuan-Omni-1.5 employs high-quality omnimodal alignment data to perform end-to-end multimodal instruction data training. **Model Architecture**

- **End-to-end Omni-modal Architecture.** We carefully design **multi-stage and end-to-end** progressive training of different modal encoding/decoding modules to make full use of the rich knowledge in different modalities, we expect different modal knowledge to complement each other. Notably, the model is fully trained end-to-end using NTP loss in the whole pre-training stage. - **High-quality Controllable Audio Solution.** Multimodal system prompts have been redesigned to include traditional text system prompts and **speech system prompts** for specifying model sounds. It provides the flexibility to control voice style through text or speech samples at inference time, and supports advanced capabilities such as end-to-end voice cloning and timbre creation. ### Open-source Evaluation Datasets **OpenMM-Medical** To comprehensively evaluate the model's multi-modal medical capabilities, we have constructed OpenMM-Medical, which includes data from 42 publicly available medical image datasets such as ACRIMA (retinal images), BioMediTech (microscope images), and CoronaHack (X-rays), totaling 88,996 images. **OpenAudioBench** To efficiently assess the model's "IQ" issues, we developed OpenAudioBench, comprising five end-to-end audio understanding sub-datasets: four public benchmarks (Llama Question, WEB QA, TriviaQA, AlpacaEval), and an internally created speech logical reasoning dataset by the Baichuan team, totaling 2,701 entries. This suite reflects the model's comprehensive "IQ" level. ### Evaluation We sugguest readers to refer to our [**Github**](https://github.com/baichuan-inc/Baichuan-Omni-1.5/) for more details.

click to view

#### Pure Text Understanding

Comprehensive Tasks
Model	Size	MMLU (Acc.)	CMMLU (Acc.)	AGIEval (Acc.)	C-Eval (Acc.)	GAOKAO (Acc.)
Proprietary Models
GPT 4o	-	88.0♢	78.3♢	62.3♢	86.0♢	-
GPT 4o mini	-	82.0	67.6	52.2	63.6	70.8
Open-source Models (Pure text)
MAP-Neo	7B	58.2	55.1	33.9	57.5	-
Qwen1.5-Chat	7B	61.5	68.0	39.3	68.8	-
Llama3-Instruct	8B	67.1	51.7	38.4	50.7	-
OLMo	7B	28.4	25.6	19.9	27.3	-
Open-source Models (Omni-modal)
VITA	8x7B	71.0*	46.6	46.2*	56.7*	-
VITA-1.5	7B	71.0	75.1	47.9	65.6	57.4
Baichuan-Omni	7B	65.3	72.2	47.7	68.9	-
MiniCPM-o 2.6	7B	65.3	63.3	50.9	61.5	56.3
Baichuan-Omni-1.5	7B	72.2	75.5	54.4	73.1	73.5

Click here to view detailed evaluation results of image understanding ability.

#### Image understanding ability

Multi-choice & Yes-or-No Question
Model	Size	MMBench-EN (Acc.)	MMbench-CN (Acc.)	SEED-IMG (Acc.)	MMMU-val (Acc.)	HallusionBench (Acc.)
Proprietary Models
GPT-4o	-	83.4♢	82.1♢	-	69.1♢	55.0♢
GPT-4o-mini	-	77.7	76.9	72.3	60.0♢	46.1♢
Open Source Models (Vision-Language)
Qwen2-VL-7B	7B	86.4	81.9	76.5	52.7	50.6∗
MiniCPM-Llama3-V 2.5	8B	76.7	73.3	72.4	45.8∗	42.5
Open Source Models (Omni-modal)
VITA	8x7B	74.7	71.4	72.6	45.3	39.7∗
VITA-1.5	7B	80.8	80.2	74.2	53.1	44.1
Baichuan-Omni	7B	76.2	74.9	74.1	47.3	47.8
MiniCPM-o 2.6	7B	83.6	81.8	75.4	51.1	50.1
Baichuan-Omni-1.5	7B	85.6	83.6	75.7	53.9	49.7

Visual Question Answering
Model	Size	RealWorldQA (Acc.)	MathVista-mini (Acc.)	TextVQA-val (Acc.)	ChartQA (Acc.)	OCRBench (Acc.)
Proprietary Models
GPT-4o	-	75.4♢	63.8♢	-	85.7♢	73.6♢
GPT-4o-mini	-	66.3	53.4	66.8	-	77.4
Open Source Models (Vision-Language)
Qwen2-VL-7B	7B	69.7	58.2∗	84.3∗	83.0∗	84.5∗
MiniCPM-Llama3-V 2.5	8B	63.5	54.3∗	76.6	72.0	72.5
Open Source Models (Omni-modal)
VITA	8x7B	59.0	44.9∗	71.8	76.6	68.5∗
VITA-1.5	7B	66.8	66.5	74.9	79.6	73.3
Baichuan-Omni	7B	62.6	51.9	74.3	79.6	70.0
MiniCPM-o 2.6	7B	67.7	64.6	80.1	87.6	89.7∗
Baichuan-Omni-1.5	7B	68.8	63.6	83.2	84.9	84.0

Click here to view detailed evaluation results of video understanding ability.

#### Video understanding ability

General VQA
Model	Size	# Frames	MVBench (Acc.)	Egoschema (Acc.)	VideoMME (Acc.)	Perception-Test (Acc.)
Proprietary Models
Gemini 1.5 Pro	-	-	81.3♢	63.2*	75.0♢	-
GPT 4o mini	-	-	55.2	58.5	63.6	48.2
GPT 4o	-	-	-	77.2*	71.9♢	-
GPT 4V	-	-	43.7♢	55.6*	59.9♢	-
Open-source Models (Vision-language)
Qwen2-VL-7B	7B	2 fps (max 768)	67.0* \| 64.4	66.7* \| 66.6	63.3* \| 59.0	62.3* \| 60.3
AnyGPT	8B	48	33.2	32.1	29.8	29.1
VideoLLaMA 2	7B	16	54.6*	51.7*	46.6*	51.4*
VideoChat2	7B	16	51.1*	42.1♢	33.7♢	47.3♢
LLaVA-NeXT-Video	7B	32	46.5♢	43.9♢	33.7♢	48.8♢
Video-LLaVA	7B	8	41.0♢	38.4♢	39.9♢	44.3♢
Open-source Models (Omni-modal)
VITA	8x7B	1 fps (max 32)	53.4	53.9	56.1	56.2
VITA-1.5	7B	1 fps (max 32)	55.5	54.7	57.3	57.6
Baichuan-Omni	7B	1 fps (max 48)	60.9	58.8	58.2	56.8
MiniCPM-o 2.6	7B	1 fps (max 64)	58.6	50.7	63.4	66.6
Baichuan-Omini-1.5	7B	1 fps (max 32)	63.7	62.4	60.1	68.9

Open-ended VQA
Model	Size	# Frames	ActivityNet-QA		MSVD-QA
Model	Size	# Frames	(Acc.)	(Score)	(Acc.)	(Score)
Proprietary Models
Gemini 1.5 Pro	-	-	56.7*	-	-	-
GPT 4o mini	-	1 fps (max 32)	62.1	3.1	67.5	3.3
GPT 4o	-	-	61.9*	-	-	-
GPT 4V	-	-	59.5*	-	-	-
Open-source Models (Vision-language)
Qwen2 VL	7B	2 fps (max 768)	17.4	1.9	61.1	3.5
VideoLLaMA 2	7B	16	50.2*	3.3*	70.9*	3.8*
VideoChat2	7B	16	49.1*	3.3*	70.0*	3.9*
LLaVA-NeXT-Video	7B	32	53.5*	3.2*	67.4	3.4
Video-LLaVA	7B	8	45.3*	3.3*	70.7*	3.9*
Open-source Models (Omni-modal)
VITA	8x7B	1 fps (max 32)	55.0	3.5	63.9	3.7
VITA-1.5	7B	1 fps (max 32)	59.6	3.0	67.6	3.3
Baichuan-Omni	7B	1 fps (max 48)	58.6	3.7	72.2	4.0
MiniCPM-o 2.6	7B	1 fps (max 64)	63.0	3.1	73.7	3.6
Baichuan-Omni-1.5	7B	1 fps (max 48)	62.0	3.1	74.2	3.6

Click here to view detailed evaluation results of audio understanding and generation ability.

#### Audio understanding and generation ability

Audio Comprehensive Capacity
Model	Size	Reasoning QA		Llama Questions		Web Questions		TriviaQA		AlpacaEval
Model	Size	s→t	s→s	s→t	s→s	s→t	s→s	s→t	s→s	s→t	s→s
Proprietary Models
GPT-4o-Audio	-	55.6	-	88.4	-	8.10	-	9.06	-	8.01	-
Open-source Models (Pure Audio)
GLM-4-Voice	9B	-	26.5	-	71.0	-	5.15	-	4.66	-	4.89
Open-source Models (Omni-modal)
VITA-1.5	7B	41.0	-	74.2	-	5.73	-	4.68	-	6.82	-
MiniCPM-o 2.6	7B	38.6	-	77.8	-	6.86	-	6.19	-	5.18	-
Baichuan-Omni-1.5	7B	50.0	40.9	78.5	75.3	5.91	5.52	5.72	5.31	7.79	6.94

### Examples