-
The Evolution of Multimodal Model Architectures
Paper • 2405.17927 • Published • 1 -
What matters when building vision-language models?
Paper • 2405.02246 • Published • 102 -
Efficient Architectures for High Resolution Vision-Language Models
Paper • 2501.02584 • Published -
Building and better understanding vision-language models: insights and future directions
Paper • 2408.12637 • Published • 125
Collections
Discover the best community collections!
Collections including paper arxiv:2402.03766
-
Exploring the Potential of Encoder-free Architectures in 3D LMMs
Paper • 2502.09620 • Published • 25 -
The Evolution of Multimodal Model Architectures
Paper • 2405.17927 • Published • 1 -
What matters when building vision-language models?
Paper • 2405.02246 • Published • 102 -
Efficient Architectures for High Resolution Vision-Language Models
Paper • 2501.02584 • Published
-
MM-LLMs: Recent Advances in MultiModal Large Language Models
Paper • 2401.13601 • Published • 47 -
Orion-14B: Open-source Multilingual Large Language Models
Paper • 2401.12246 • Published • 13 -
Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
Paper • 2405.09215 • Published • 22 -
AlignGPT: Multi-modal Large Language Models with Adaptive Alignment Capability
Paper • 2405.14129 • Published • 13
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 126 -
Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 52 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper • 2402.03766 • Published • 14 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 65
-
VisionLLaMA: A Unified LLaMA Interface for Vision Tasks
Paper • 2403.00522 • Published • 45 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper • 2402.03766 • Published • 14 -
MobileVLM : A Fast, Reproducible and Strong Vision Language Assistant for Mobile Devices
Paper • 2312.16886 • Published • 20 -
Lenna: Language Enhanced Reasoning Detection Assistant
Paper • 2312.02433 • Published • 2
-
Textbooks Are All You Need
Paper • 2306.11644 • Published • 142 -
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
Paper • 2401.02330 • Published • 16 -
Textbooks Are All You Need II: phi-1.5 technical report
Paper • 2309.05463 • Published • 87 -
Visual Instruction Tuning
Paper • 2304.08485 • Published • 13
-
Self-Rewarding Language Models
Paper • 2401.10020 • Published • 146 -
ReFT: Reasoning with Reinforced Fine-Tuning
Paper • 2401.08967 • Published • 30 -
Tuning Language Models by Proxy
Paper • 2401.08565 • Published • 23 -
TrustLLM: Trustworthiness in Large Language Models
Paper • 2401.05561 • Published • 69
-
Extending Context Window of Large Language Models via Semantic Compression
Paper • 2312.09571 • Published • 15 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 50 -
LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models
Paper • 2312.02949 • Published • 14 -
TinyLLaVA: A Framework of Small-scale Large Multimodal Models
Paper • 2402.14289 • Published • 19
-
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Paper • 2312.08578 • Published • 20 -
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Paper • 2312.08583 • Published • 12 -
Vision-Language Models as a Source of Rewards
Paper • 2312.09187 • Published • 14 -
StemGen: A music generation model that listens
Paper • 2312.08723 • Published • 48