自律運転における反射型ビジョン・ランゲージ・アクションモデルのための離散拡散

なぜ重要か: 企業や社会への影響が見込まれ、一般メディアにも波及する可能性があります。

ソースを読む（export.arxiv.org）

arXiv:2509.20109v1発表タイプ：クロス

概要：エンドツーエンド（E2E）ソリューションは自動運転システムにおける主流アプローチとして台頭しており、Vision-Language-Action（VLA）モデルは、Vision-Language Model（VLM）からの事前学習済みマルチモーダル知識を活用して複雑な現実世界の環境を解釈し、相互作用する新たなパラダイムを表しています。しかし、これらの方法は模倣学習の限界によって制約されており、訓練中に物理法則を本質的に符号化することに苦労しています。既存のアプローチは、複雑なルールベースの後処理に依存したり、シミュレーションに大きく限定された強化学習を使用したり、計算コストの高い勾配計算を必要とする拡散ガイダンスを使用したりすることが多いです。これらの課題に対処するため、本稿では、離散拡散による安全な軌跡生成のための反射機構を統合した、新たな学習ベースのフレームワークであるReflectDriveを紹介します。まず、二次元運転空間を離散化して行動コードブックを作成し、微調整を通じて計画タスクに事前学習済みDiffusion Language Modelを使用できるようにします。本手法の中核は、勾配計算なしに反復的な自己修正を行う安全意識の高い反射機構です。本手法は、多様な運転行動をモデル化する目的条件付き軌跡生成から始まります。これに基づいて、局所探索法を用いて安全でないトークンを特定し、実行可能な解を決定し、それをインペインティングベースの再生のための安全なアンカーとして使用します。NAVSIMベンチマークで評価されたReflectDriveは、安全性の重要な軌跡生成において顕著な利点を示し、自動運転システムのためのスケーラブルで信頼性の高いソリューションを提供します。

原文（英語）を表示

Title (EN): Discrete Diffusion for Reflective Vision-Language-Action Models in Autonomous Driving

arXiv:2509.20109v1 Announce Type: cross
Abstract: End-to-End (E2E) solutions have emerged as a mainstream approach for autonomous driving systems, with Vision-Language-Action (VLA) models representing a new paradigm that leverages pre-trained multimodal knowledge from Vision-Language Models (VLMs) to interpret and interact with complex real-world environments. However, these methods remain constrained by the limitations of imitation learning, which struggles to inherently encode physical rules during training. Existing approaches often rely on complex rule-based post-refinement, employ reinforcement learning that remains largely limited to simulation, or utilize diffusion guidance that requires computationally expensive gradient calculations. To address these challenges, we introduce ReflectDrive, a novel learning-based framework that integrates a reflection mechanism for safe trajectory generation via discrete diffusion. We first discretize the two-dimensional driving space to construct an action codebook, enabling the use of pre-trained Diffusion Language Models for planning tasks through fine-tuning. Central to our approach is a safety-aware reflection mechanism that performs iterative self-correction without gradient computation. Our method begins with goal-conditioned trajectory generation to model multi-modal driving behaviors. Based on this, we apply local search methods to identify unsafe tokens and determine feasible solutions, which then serve as safe anchors for inpainting-based regeneration. Evaluated on the NAVSIM benchmark, ReflectDrive demonstrates significant advantages in safety-critical trajectory generation, offering a scalable and reliable solution for autonomous driving systems.

Published: 2025-09-24 19:00 UTC

自律運転における反射型ビジョン・ランゲージ・アクションモデルのための離散拡散

コメントする コメントをキャンセル

コメントするコメントをキャンセル