Multimodal
updated
Qwen2.5-Omni Technical Report
Paper
• 2503.20215
• Published
• 170
Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO
Paper
• 2505.22453
• Published
• 46
UniRL: Self-Improving Unified Multimodal Models via Supervised and
Reinforcement Learning
Paper
• 2505.23380
• Published
• 22
More Thinking, Less Seeing? Assessing Amplified Hallucination in
Multimodal Reasoning Models
Paper
• 2505.21523
• Published
• 13
Visual Embodied Brain: Let Multimodal Large Language Models See, Think,
and Control in Spaces
Paper
• 2506.00123
• Published
• 35
Aligning Latent Spaces with Flow Priors
Paper
• 2506.05240
• Published
• 27
Discrete Diffusion in Large Language and Multimodal Models: A Survey
Paper
• 2506.13759
• Published
• 43
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark
for Financial LLM Evaluation
Paper
• 2506.14028
• Published
• 93
Show-o2: Improved Native Unified Multimodal Models
Paper
• 2506.15564
• Published
• 29
OmniGen2: Exploration to Advanced Multimodal Generation
Paper
• 2506.18871
• Published
• 78
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual
Tokens
Paper
• 2506.17218
• Published
• 29
UniFork: Exploring Modality Alignment for Unified Multimodal
Understanding and Generation
Paper
• 2506.17202
• Published
• 10
HumanOmniV2: From Understanding to Omni-Modal Reasoning with Context
Paper
• 2506.21277
• Published
• 14
GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable
Reinforcement Learning
Paper
• 2507.01006
• Published
• 251
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and
Visual Documents
Paper
• 2507.04590
• Published
• 17
Robust Multimodal Large Language Models Against Modality Conflict
Paper
• 2507.07151
• Published
• 6
Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation
from Diffusion Models
Paper
• 2507.07104
• Published
• 46
Can Multimodal Foundation Models Understand Schematic Diagrams? An
Empirical Study on Information-Seeking QA over Scientific Papers
Paper
• 2507.10787
• Published
• 13
VL-Cogito: Progressive Curriculum Reinforcement Learning for Advanced
Multimodal Reasoning
Paper
• 2507.22607
• Published
• 47
Skywork UniPic: Unified Autoregressive Modeling for Visual Understanding
and Generation
Paper
• 2508.03320
• Published
• 63
Visual-CoG: Stage-Aware Reinforcement Learning with Chain of Guidance
for Text-to-Image Generation
Paper
• 2508.18032
• Published
• 41
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility,
Reasoning, and Efficiency
Paper
• 2508.18265
• Published
• 214
Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable
Text-to-Image Reinforcement Learning
Paper
• 2508.20751
• Published
• 89
OpenVision 2: A Family of Generative Pretrained Visual Encoders for
Multimodal Learning
Paper
• 2509.01644
• Published
• 34
Visual Representation Alignment for Multimodal Large Language Models
Paper
• 2509.07979
• Published
• 84
Reconstruction Alignment Improves Unified Multimodal Models
Paper
• 2509.07295
• Published
• 40
MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid
Vision Tokenizer
Paper
• 2509.16197
• Published
• 58
Hyper-Bagel: A Unified Acceleration Framework for Multimodal
Understanding and Generation
Paper
• 2509.18824
• Published
• 23
Learning to See Before Seeing: Demystifying LLM Visual Priors from
Language Pre-training
Paper
• 2509.26625
• Published
• 43
More Thought, Less Accuracy? On the Dual Nature of Reasoning in
Vision-Language Models
Paper
• 2509.25848
• Published
• 80
IMG: Calibrating Diffusion Models via Implicit Multimodal Guidance
Paper
• 2509.26231
• Published
• 18
Self-Improvement in Multimodal Large Language Models: A Survey
Paper
• 2510.02665
• Published
• 21
MathSE: Improving Multimodal Mathematical Reasoning via Self-Evolving Iterative Reflection and Reward-Guided Fine-Tuning
Paper
• 2511.06805
• Published
• 13
Mixture of States: Routing Token-Level Dynamics for Multimodal Generation
Paper
• 2511.12207
• Published
• 10
M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark
Paper
• 2511.17729
• Published
• 17
Does Understanding Inform Generation in Unified Multimodal Models? From Analysis to Path Forward
Paper
• 2511.20561
• Published
• 32
Architecture Decoupling Is Not All You Need For Unified Multimodal Model
Paper
• 2511.22663
• Published
• 29
MMGR: Multi-Modal Generative Reasoning
Paper
• 2512.14691
• Published
• 119
HyperVL: An Efficient and Dynamic Multimodal Large Language Model for Edge Devices
Paper
• 2512.14052
• Published
• 42
NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation
Paper
• 2601.02204
• Published
• 62
LaViT: Aligning Latent Visual Thoughts for Multi-modal Reasoning
Paper
• 2601.10129
• Published
• 12
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents
Paper
• 2601.16973
• Published
• 40
MMFineReason: Closing the Multimodal Reasoning Gap via Open Data-Centric Methods
Paper
• 2601.21821
• Published
• 60
Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models
Paper
• 2601.22060
• Published
• 154
OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models
Paper
• 2602.04804
• Published
• 46
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models
Paper
• 2602.07026
• Published
• 138
Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device
Paper
• 2602.20161
• Published
• 23
Beyond Language Modeling: An Exploration of Multimodal Pretraining
Paper
• 2603.03276
• Published
• 76