Papers
arxiv:2508.19493

Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents

Published on Aug 27
· Submitted by Jungang on Aug 28
Authors:
,
,
,
,

Abstract

A large-scale benchmark evaluates the privacy awareness of smartphone agents powered by Multimodal Large Language Models, revealing significant gaps in their ability to protect sensitive user information.

AI-generated summary

Smartphones bring significant convenience to users but also enable devices to extensively record various types of personal information. Existing smartphone agents powered by Multimodal Large Language Models (MLLMs) have achieved remarkable performance in automating different tasks. However, as the cost, these agents are granted substantial access to sensitive users' personal information during this operation. To gain a thorough understanding of the privacy awareness of these agents, we present the first large-scale benchmark encompassing 7,138 scenarios to the best of our knowledge. In addition, for privacy context in scenarios, we annotate its type (e.g., Account Credentials), sensitivity level, and location. We then carefully benchmark seven available mainstream smartphone agents. Our results demonstrate that almost all benchmarked agents show unsatisfying privacy awareness (RA), with performance remaining below 60% even with explicit hints. Overall, closed-source agents show better privacy ability than open-source ones, and Gemini 2.0-flash achieves the best, achieving an RA of 67%. We also find that the agents' privacy detection capability is highly related to scenario sensitivity level, i.e., the scenario with a higher sensitivity level is typically more identifiable. We hope the findings enlighten the research community to rethink the unbalanced utility-privacy tradeoff about smartphone agents. Our code and benchmark are available at https://zhixin-l.github.io/SAPA-Bench.

Community

Paper author Paper submitter
This comment has been hidden (marked as Off-Topic)
Paper author Paper submitter

SAPA-Bench

Really interesting benchmark, especially the finding that even with explicit hints, most agents still fall below 60% in privacy awareness. Do you think this weakness stems more from limitations in multimodal reasoning itself, or from the lack of explicit privacy-oriented training data? Also, how might future benchmarks better capture the tradeoff between task utility and privacy sensitivity in real-world smartphone use?

Thank you for raising this interesting point.I think both factors play a significant role. On one hand, there are intrinsic limitations in multimodal reasoning: even with explicit privacy prompts, existing MLLM agents still struggle with sensor perception, multimodal integration, and contextual understanding—especially when it comes to recognizing privacy-sensitive content. On the other hand, the scarcity of privacy-oriented training data further impairs performance; without datasets that highlight sensitive cues, models struggle to detect subtle privacy risks or develop safe behavior patterns.

Looking ahead, I believe future benchmarks should explicitly incorporate task utility–privacy trade-off scenarios. For example, a benchmark could require agents to first identify and alert about potential privacy risks before proceeding with a task, forcing them to balance functionality with privacy considerations responsibly.

·

Thanks for the clear response, @Zhixin-L . I really like the idea of benchmarks that make agents flag risks before acting, feels closer to how users actually expect them to behave.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.19493 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.19493 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.19493 in a Space README.md to link it from this page.

Collections including this paper 2