VisualTrap：視覚的グラウンディング操作によるGUIエージェントへのステルスバックドア攻撃

なぜ重要か: 企業や社会への影響が見込まれ、一般メディアにも波及する可能性があります。

ソースを読む（export.arxiv.org）

arXiv:2507.06899v2 発表種別：replace-cross

概要：大規模視覚言語モデル (LVLMs) によって強化されたグラフィカルユーザーインターフェース (GUI) エージェントは、人間と機械の相互作用を自動化する革命的なアプローチとして台頭しており、個人用デバイス（例：携帯電話）またはデバイス内のアプリケーションを自律的に操作して、人間のような方法で複雑な現実世界のタスクを実行できます。しかし、個人用デバイスとの緊密な統合は、バックドア攻撃を含む多くの脅威により、深刻なセキュリティ上の懸念を引き起こしており、その多くはほとんど調査されていません。本研究は、GUIエージェントがテキストプランをGUI要素にマッピングする視覚的グラウンディングに脆弱性が存在することを明らかにし、新しいタイプのバックドア攻撃を可能にします。視覚的グラウンディングを標的とするバックドア攻撃では、正しいタスク解決プランが与えられていても、エージェントの動作が侵害される可能性があります。この脆弱性を検証するために、エージェントを誤って誘導し、意図したターゲットではなくトリガー位置にテキストプランを配置させることでグラウンディングを乗っ取る手法、VisualTrapを提案します。VisualTrapは、攻撃のために毒入りデータ注入という一般的な手法を使用し、視覚的グラウンディングの事前トレーニング中にこれを実行することで、攻撃の実現可能性を確保します。実験結果は、VisualTrapがわずか5%の毒入りデータと非常に隠密性の高い視覚的トリガー（人間の目には見えない）で視覚的グラウンディングを効果的に乗っ取ることができ、その攻撃はクリーンなファインチューニング後もダウンストリームタスクに一般化できることを示しています。さらに、注入されたトリガーは、モバイル/ウェブでトレーニングされ、デスクトップ環境に一般化されるなど、異なるGUI環境で有効性を維持できます。これらの知見は、GUIエージェントにおけるバックドア攻撃のリスクに関するさらなる研究の緊急性を強調しています。

原文（英語）を表示

Title (EN): VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation

arXiv:2507.06899v2 Announce Type: replace-cross
Abstract: Graphical User Interface (GUI) agents powered by Large Vision-Language Models (LVLMs) have emerged as a revolutionary approach to automating human-machine interactions, capable of autonomously operating personal devices (e.g., mobile phones) or applications within the device to perform complex real-world tasks in a human-like manner. However, their close integration with personal devices raises significant security concerns, with many threats, including backdoor attacks, remaining largely unexplored. This work reveals that the visual grounding of GUI agent-mapping textual plans to GUI elements-can introduce vulnerabilities, enabling new types of backdoor attacks. With backdoor attack targeting visual grounding, the agent’s behavior can be compromised even when given correct task-solving plans. To validate this vulnerability, we propose VisualTrap, a method that can hijack the grounding by misleading the agent to locate textual plans to trigger locations instead of the intended targets. VisualTrap uses the common method of injecting poisoned data for attacks, and does so during the pre-training of visual grounding to ensure practical feasibility of attacking. Empirical results show that VisualTrap can effectively hijack visual grounding with as little as 5% poisoned data and highly stealthy visual triggers (invisible to the human eye); and the attack can be generalized to downstream tasks, even after clean fine-tuning. Moreover, the injected trigger can remain effective across different GUI environments, e.g., being trained on mobile/web and generalizing to desktop environments. These findings underscore the urgent need for further research on backdoor attack risks in GUI agents.

Published: 2025-09-24 19:00 UTC

VisualTrap：視覚的グラウンディング操作によるGUIエージェントへのステルスバックドア攻撃

コメントする コメントをキャンセル

コメントするコメントをキャンセル