OmniAgent: Audio-Guided Active Perception Agent for Omnimodal Audio-Video Understanding
Abstract
Omnimodal large language models have made significant strides in unifying audio and visual modalities; however, they often lack the fine-grained cross-modal understanding and have difficulty with multimodal alignment. To address these limitations, we introduce OmniAgent, a fully audio-guided active perception agent that dynamically orchestrates specialized tools to achieve more fine-grained audio-visual reasoning. Unlike previous works that rely on rigid, static workflows and dense frame-captioning, this paper demonstrates a paradigm shift from passive response generation to active multimodal inquiry. OmniAgent employs dynamic planning to autonomously orchestrate tool invocation on demand, strategically concentrating perceptual attention on task-relevant cues. Central to our approach is a novel coarse-to-fine audio-guided perception paradigm, which leverages audio cues to localize temporal events and guide subsequent reasoning. Extensive empirical evaluations on three audio-video understanding benchmarks demonstrate that OmniAgent achieves state-of-the-art performance, surpassing leading open-source and proprietary models by substantial margins of 10% - 20% accuracy.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding (2025)
- Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything (2025)
- LongVideoAgent: Multi-Agent Reasoning with Long Videos (2025)
- JointAVBench: A Benchmark for Joint Audio-Visual Reasoning Evaluation (2025)
- EchoingPixels: Cross-Modal Adaptive Token Reduction for Efficient Audio-Visual LLMs (2025)
- OmniZip: Audio-Guided Dynamic Token Compression for Fast Omnimodal Large Language Models (2025)
- ChronusOmni: Improving Time Awareness of Omni Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper