Papers
arxiv:2303.08342

Autonomous Soundscape Augmentation with Multimodal Fusion of Visual and Participant-linked Inputs

Published on Mar 15, 2023
Authors:
,
,
,

Abstract

Autonomous soundscape augmentation systems typically use trained models to pick optimal maskers to effect a desired perceptual change. While acoustic information is paramount to such systems, contextual information, including participant demographics and the visual environment, also influences acoustic perception. Hence, we propose modular modifications to an existing attention-based deep neural network, to allow early, mid-level, and late feature fusion of participant-linked, visual, and acoustic features. Ablation studies on module configurations and corresponding fusion methods using the ARAUS dataset show that contextual features improve the model performance in a statistically significant manner on the normalized ISO Pleasantness, to a mean squared error of 0.1194pm0.0012 for the best-performing all-modality model, against 0.1217pm0.0009 for the audio-only model. Soundscape augmentation systems can thereby leverage multimodal inputs for improved performance. We also investigate the impact of individual participant-linked factors using trained models to illustrate improvements in model explainability.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2303.08342 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2303.08342 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2303.08342 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.