VLMによる視覚言語モデル：人間のデモンストレーション動画からロボット動作計画へ

なぜ重要か: 企業や社会への影響が見込まれ、一般メディアにも波及する可能性があります。

ソースを読む（export.arxiv.org）

arXiv:2410.08792v2発表タイプ：置換クロス

概要：ビジョン言語モデル（VLMs）は、常識推論と汎化能力を備えていることから、近年ロボット工学において採用されている。既存研究では、自然言語指示からタスクとモーションプランニングを生成し、ロボット学習のための訓練データをシミュレートするためにVLMsが適用されてきた。本研究では、VLMを用いて人間の模範動作ビデオを解釈し、ロボットタスクプランニングを生成することを検討する。本手法は、キーフレーム選択、視覚認識、およびVLM推論をパイプラインに統合する。我々は、VLMが人間の模範動作を「見る」ことができ、対応する計画をロボットに「実行」させることができることから、これをSeeDoと名付けた。本アプローチを検証するために、3つの多様なカテゴリにおけるピックアンドプレースタスクを示す一連の長時間人間のビデオを収集し、最先端のビデオ入力VLMを含むいくつかのベースラインとSeeDoを包括的にベンチマークするための指標を設計した。実験により、SeeDoの優れた性能が示された。さらに、生成されたタスクプランをシミュレーション環境と実ロボットアームの両方で展開した。

原文（英語）を表示

Title (EN): VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model

arXiv:2410.08792v2 Announce Type: replace-cross
Abstract: Vision Language Models (VLMs) have recently been adopted in robotics for their capability in common sense reasoning and generalizability. Existing work has applied VLMs to generate task and motion planning from natural language instructions and simulate training data for robot learning. In this work, we explore using VLM to interpret human demonstration videos and generate robot task planning. Our method integrates keyframe selection, visual perception, and VLM reasoning into a pipeline. We named it SeeDo because it enables the VLM to ”see” human demonstrations and explain the corresponding plans to the robot for it to ”do”. To validate our approach, we collected a set of long-horizon human videos demonstrating pick-and-place tasks in three diverse categories and designed a set of metrics to comprehensively benchmark SeeDo against several baselines, including state-of-the-art video-input VLMs. The experiments demonstrate SeeDo’s superior performance. We further deployed the generated task plans in both a simulation environment and on a real robot arm.

Published: 2025-09-24 19:00 UTC

VLMによる視覚言語モデル：人間のデモンストレーション動画からロボット動作計画へ

コメントする コメントをキャンセル

コメントするコメントをキャンセル