--- license: apache-2.0 pipeline_tag: any-to-any ---
Baichuan-Omni-1.5 🤗 | Baichuan-Omni-1.5-Base 🤗 |Github 📖 | Report 📖
Comprehensive Tasks | ||||||
---|---|---|---|---|---|---|
Model | Size | MMLU (Acc.) | CMMLU (Acc.) | AGIEval (Acc.) | C-Eval (Acc.) | GAOKAO (Acc.) |
Proprietary Models | ||||||
GPT 4o | - | 88.0♢ |
78.3♢ |
62.3♢ |
86.0♢ |
- |
GPT 4o mini | - | 82.0 | 67.6 | 52.2 | 63.6 | 70.8 |
Open-source Models (Pure text) | ||||||
MAP-Neo | 7B | 58.2 | 55.1 | 33.9 | 57.5 | - |
Qwen1.5-Chat | 7B | 61.5 | 68.0 | 39.3 | 68.8 | - |
Llama3-Instruct | 8B | 67.1 | 51.7 | 38.4 | 50.7 | - |
OLMo | 7B | 28.4 | 25.6 | 19.9 | 27.3 | - |
Open-source Models (Omni-modal) | ||||||
VITA | 8x7B | 71.0* | 46.6 | 46.2* | 56.7* | - |
VITA-1.5 | 7B | 71.0 | 75.1 | 47.9 | 65.6 | 57.4 |
Baichuan-Omni | 7B | 65.3 | 72.2 | 47.7 | 68.9 | - |
MiniCPM-o 2.6 | 7B | 65.3 | 63.3 | 50.9 | 61.5 | 56.3 |
Baichuan-Omni-1.5 |
7B | 72.2 | 75.5 | 54.4 | 73.1 | 73.5 |
Multi-choice & Yes-or-No Question | ||||||||
---|---|---|---|---|---|---|---|---|
Model | Size | MMBench-EN (Acc.) |
MMbench-CN (Acc.) |
SEED-IMG (Acc.) |
MMMU-val (Acc.) |
HallusionBench (Acc.) |
||
Proprietary Models | ||||||||
GPT-4o | - | 83.4♢ | 82.1♢ | - | 69.1♢ |
55.0♢ |
||
GPT-4o-mini | - | 77.7 | 76.9 | 72.3 | 60.0♢ | 46.1♢ | ||
Open Source Models (Vision-Language) | ||||||||
Qwen2-VL-7B | 7B | 86.4 |
81.9 | 76.5 |
52.7 | 50.6∗ | ||
MiniCPM-Llama3-V 2.5 | 8B | 76.7 | 73.3 | 72.4 | 45.8∗ | 42.5 | ||
Open Source Models (Omni-modal) | ||||||||
VITA | 8x7B | 74.7 | 71.4 | 72.6 | 45.3 | 39.7∗ | ||
VITA-1.5 | 7B | 80.8 | 80.2 | 74.2 | 53.1 | 44.1 | ||
Baichuan-Omni | 7B | 76.2 | 74.9 | 74.1 | 47.3 | 47.8 | ||
MiniCPM-o 2.6 | 7B | 83.6 | 81.8 | 75.4 | 51.1 | 50.1 | ||
Baichuan-Omni-1.5 |
7B | 85.6 | 83.6 |
75.7 | 53.9 | 49.7 |
Visual Question Answering | ||||||||
---|---|---|---|---|---|---|---|---|
Model | Size | RealWorldQA (Acc.) |
MathVista-mini (Acc.) |
TextVQA-val (Acc.) |
ChartQA (Acc.) |
OCRBench (Acc.) |
||
Proprietary Models | ||||||||
GPT-4o | - | 75.4♢ |
63.8♢ | - | 85.7♢ | 73.6♢ | ||
GPT-4o-mini | - | 66.3 | 53.4 | 66.8 | - | 77.4 | ||
Open Source Models (Vision-Language) | ||||||||
Qwen2-VL-7B | 7B | 69.7 | 58.2∗ | 84.3∗ |
83.0∗ | 84.5∗ | ||
MiniCPM-Llama3-V 2.5 | 8B | 63.5 | 54.3∗ | 76.6 | 72.0 | 72.5 | ||
Open Source Models (Omni-modal) | ||||||||
VITA | 8x7B | 59.0 | 44.9∗ | 71.8 | 76.6 | 68.5∗ | ||
VITA-1.5 | 7B | 66.8 | 66.5 |
74.9 | 79.6 | 73.3 | ||
Baichuan-Omni | 7B | 62.6 | 51.9 | 74.3 | 79.6 | 70.0 | ||
MiniCPM-o 2.6 | 7B | 67.7 | 64.6 | 80.1 | 87.6 |
89.7∗ |
||
Baichuan-Omni-1.5 | 7B | 68.8 | 63.6 | 83.2 | 84.9 | 84.0 |
General VQA | ||||||
---|---|---|---|---|---|---|
Model | Size | # Frames | MVBench (Acc.) |
Egoschema (Acc.) |
VideoMME (Acc.) |
Perception-Test (Acc.) |
Proprietary Models | ||||||
Gemini 1.5 Pro | - | - | 81.3♢ |
63.2* | 75.0♢ |
- |
GPT 4o mini | - | - | 55.2 | 58.5 | 63.6 | 48.2 |
GPT 4o | - | - | - | 77.2* |
71.9♢ | - |
GPT 4V | - | - | 43.7♢ | 55.6* | 59.9♢ | - |
Open-source Models (Vision-language) | ||||||
Qwen2-VL-7B | 7B | 2 fps (max 768) | 67.0* | 64.4 | 66.7* | 66.6 | 63.3* | 59.0 | 62.3* | 60.3 |
AnyGPT | 8B | 48 | 33.2 | 32.1 | 29.8 | 29.1 |
VideoLLaMA 2 | 7B | 16 | 54.6* | 51.7* | 46.6* | 51.4* |
VideoChat2 | 7B | 16 | 51.1* | 42.1♢ | 33.7♢ | 47.3♢ |
LLaVA-NeXT-Video | 7B | 32 | 46.5♢ | 43.9♢ | 33.7♢ | 48.8♢ |
Video-LLaVA | 7B | 8 | 41.0♢ | 38.4♢ | 39.9♢ | 44.3♢ |
Open-source Models (Omni-modal) | ||||||
VITA | 8x7B | 1 fps (max 32) | 53.4 | 53.9 | 56.1 | 56.2 |
VITA-1.5 | 7B | 1 fps (max 32) | 55.5 | 54.7 | 57.3 | 57.6 |
Baichuan-Omni | 7B | 1 fps (max 48) | 60.9 | 58.8 | 58.2 | 56.8 |
MiniCPM-o 2.6 | 7B | 1 fps (max 64) | 58.6 | 50.7 | 63.4 | 66.6 |
Baichuan-Omini-1.5 | 7B | 1 fps (max 32) | 63.7 | 62.4 | 60.1 | 68.9 |
Open-ended VQA | ||||||
---|---|---|---|---|---|---|
Model | Size | # Frames | ActivityNet-QA | MSVD-QA | ||
(Acc.) | (Score) | (Acc.) | (Score) | |||
Proprietary Models | ||||||
Gemini 1.5 Pro | - | - | 56.7* | - | - | - |
GPT 4o mini | - | 1 fps (max 32) | 62.1 | 3.1 | 67.5 | 3.3 |
GPT 4o | - | - | 61.9* | - | - | - |
GPT 4V | - | - | 59.5* | - | - | - |
Open-source Models (Vision-language) | ||||||
Qwen2 VL | 7B | 2 fps (max 768) | 17.4 | 1.9 | 61.1 | 3.5 |
VideoLLaMA 2 | 7B | 16 | 50.2* | 3.3* | 70.9* | 3.8* |
VideoChat2 | 7B | 16 | 49.1* | 3.3* | 70.0* | 3.9* |
LLaVA-NeXT-Video | 7B | 32 | 53.5* | 3.2* | 67.4 | 3.4 |
Video-LLaVA | 7B | 8 | 45.3* | 3.3* | 70.7* | 3.9* |
Open-source Models (Omni-modal) | ||||||
VITA | 8x7B | 1 fps (max 32) | 55.0 | 3.5 | 63.9 | 3.7 |
VITA-1.5 | 7B | 1 fps (max 32) | 59.6 | 3.0 | 67.6 | 3.3 |
Baichuan-Omni | 7B | 1 fps (max 48) | 58.6 | 3.7 |
72.2 | 4.0 |
MiniCPM-o 2.6 | 7B | 1 fps (max 64) | 63.0 |
3.1 | 73.7 | 3.6 |
Baichuan-Omni-1.5 | 7B | 1 fps (max 48) | 62.0 | 3.1 | 74.2 |
3.6 |
Audio Comprehensive Capacity | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Model | Size | Reasoning QA | Llama Questions | Web Questions | TriviaQA | AlpacaEval | |||||
s→t | s→s | s→t | s→s | s→t | s→s | s→t | s→s | s→t | s→s | ||
Proprietary Models | |||||||||||
GPT-4o-Audio | - | 55.6 | - | 88.4 | - | 8.10 | - | 9.06 | - | 8.01 | - |
Open-source Models (Pure Audio) | |||||||||||
GLM-4-Voice | 9B | - | 26.5 | - | 71.0 | - | 5.15 | - | 4.66 | - | 4.89 |
Open-source Models (Omni-modal) | |||||||||||
VITA-1.5 | 7B | 41.0 | - | 74.2 | - | 5.73 | - | 4.68 | - | 6.82 | - |
MiniCPM-o 2.6 | 7B | 38.6 | - | 77.8 | - | 6.86 | - | 6.19 | - | 5.18 | - |
Baichuan-Omni-1.5 | 7B | 50.0 | 40.9 | 78.5 | 75.3 | 5.91 | 5.52 | 5.72 | 5.31 | 7.79 | 6.94 |