OmniVLA：ロボットナビゲーションのためのオムニモーダル視覚言語行動モデル

なぜ重要か: 企業や社会への影響が見込まれ、一般メディアにも波及する可能性があります。

ソースを読む（export.arxiv.org）

arXiv:2509.19480v1発表タイプ：クロス

要約：人間は目的地への移動において、言語指示、空間座標、視覚的参照など、様々な目標仕様を柔軟に解釈し、構成することができます。これに対し、既存のロボットナビゲーションポリシーの多くは単一モダリティで訓練されており、様々な目標仕様が自然かつ補完的に存在する現実世界のシナリオへの適応性が制限されています。本研究では、ビジョンベースのナビゲーションのためのオムニモダリティ目標条件付けを可能にする、ロボット基盤モデルのための訓練フレームワークを提案します。本アプローチは、高容量のビジョン・言語・行動（VLA）バックボーンを活用し、2Dポーズ、エゴセントリック画像、自然言語、およびそれらの組み合わせという3つの主要な目標モダリティを用いて、ランダム化されたモダリティ融合戦略を通じて訓練を行います。この設計は、使用可能なデータセットのプールを拡大するだけでなく、ポリシーがより豊かな幾何学的、意味的、および視覚的表現を開発することを促します。結果として得られたモデル、OmniVLAは、未見の環境への強力な汎化能力、希少なモダリティに対する堅牢性、および新規の自然言語指示に従う能力を達成します。OmniVLAは、モダリティを横断して専門家ベースラインを上回り、新しいモダリティやタスクへのファインチューニングのための柔軟な基盤を提供することを示します。OmniVLAは、広範に一般化可能で柔軟なナビゲーションポリシー、およびオムニモダリティロボット基盤モデルを構築するためのスケーラブルな道筋を提供すると考えています。OmniVLAの性能を示すビデオを公開しており、チェックポイントとトレーニングコードをプロジェクトページで公開します。

原文（英語）を表示

Title (EN): OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

arXiv:2509.19480v1 Announce Type: cross
Abstract: Humans can flexibly interpret and compose different goal specifications, such as language instructions, spatial coordinates, or visual references, when navigating to a destination. In contrast, most existing robotic navigation policies are trained on a single modality, limiting their adaptability to real-world scenarios where different forms of goal specification are natural and complementary. In this work, we present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation. Our approach leverages a high-capacity vision-language-action (VLA) backbone and trains with three primary goal modalities: 2D poses, egocentric images, and natural language, as well as their combinations, through a randomized modality fusion strategy. This design not only expands the pool of usable datasets but also encourages the policy to develop richer geometric, semantic, and visual representations. The resulting model, OmniVLA, achieves strong generalization to unseen environments, robustness to scarce modalities, and the ability to follow novel natural language instructions. We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks. We believe OmniVLA provides a step toward broadly generalizable and flexible navigation policies, and a scalable path for building omni-modal robotic foundation models. We present videos showcasing OmniVLA performance and will release its checkpoints and training code on our project page.

Published: 2025-09-24 19:00 UTC

OmniVLA：ロボットナビゲーションのためのオムニモーダル視覚言語行動モデル

コメントする コメントをキャンセル

コメントするコメントをキャンセル