GraphEQA：3D意味シーングラフを用いたリアルタイム具象質問応答

なぜ重要か: 企業や社会への影響が見込まれ、一般メディアにも波及する可能性があります。

ソースを読む（export.arxiv.org）

arXiv:2412.14480v2発表タイプ：置換クロス

要約：具現化質問応答（EQA）において、エージェントは、自信を持って状況に即した質問に答えるために、未知の環境を探求し、意味的な理解を構築する必要があります。有用な意味表現の取得、これらの表現のオンライン更新、効率的な計画と探求のための事前知識の活用における困難さから、この問題はロボット工学において依然として課題となっています。これらの限界に対処するために、我々は、リアルタイム3次元計量意味シーングラフ（3DSG）とタスク関連画像を、Vision-Language Model（VLM）を接地化するためのマルチモーダルメモリとして利用する、新規手法GraphEQAを提案します。我々は、構造化された計画と意味誘導探索のために3DSGの階層的性質を利用する階層的計画アプローチを採用しています。2つのベンチマークデータセット、HM-EQAとOpenEQAを用いてシミュレーション環境でGraphEQAを評価し、主要なベースラインを上回り、より高い成功率と少ない計画ステップでEQAタスクを完了することを示します。さらに、複数の現実世界の家庭およびオフィス環境でGraphEQAを実証します。

原文（英語）を表示

Title (EN): GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering

arXiv:2412.14480v2 Announce Type: replace-cross
Abstract: In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment to answer a situated question with confidence. This problem remains challenging in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient planning and exploration. To address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantics-guided exploration. We evaluate GraphEQA in simulation on two benchmark datasets, HM-EQA and OpenEQA, and demonstrate that it outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps. We further demonstrate GraphEQA in multiple real-world home and office environments.

Published: 2025-09-24 19:00 UTC

GraphEQA：3D意味シーングラフを用いたリアルタイム具象質問応答

コメントする コメントをキャンセル

コメントするコメントをキャンセル