SurgVidLM:大規模言語モデルを用いた多粒度手術ビデオ理解に向けて
なぜ重要か: 企業や社会への影響が見込まれ、一般メディアにも波及する可能性があります。
arXiv:2506.17873v2は、Surgical scene understandingの向上に焦点を当てた論文です。ロボット支援手術における手術トレーニングとロボットの意思決定には、手術シーンの理解が不可欠です。本研究では、手術手順の包括的理解と詳細な分析を可能にする大規模データセットSVU-31K(31,000以上のビデオ・指示ペアを含む)を用いて、動画言語モデルSurgVidLMを提案します。SurgVidLMは、二段階のStageFocus機構とMulti-frequency Fusion Attentionを採用し、全体的な手順の状況把握と、時間的手がかりに基づいた詳細な局所分析を両立させ、手術における特定タスクの詳細な実行状況を捉えます。実験結果から、SurgVidLMは同規模のパラメータを持つ最先端のVid-LLMを、包括的な動画理解と詳細な動画理解の両方において大幅に上回ることが示されました。コードとデータセットは近日公開予定です。
原文(英語)を表示
Title (EN): SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model
arXiv:2506.17873v2 Announce Type: replace-cross
Abstract: Surgical scene understanding is critical for surgical training and robotic decision-making in robot-assisted surgery. Recent advances in Multimodal Large Language Models (MLLMs) have demonstrated great potential for advancing scene perception in the medical domain, facilitating surgeons to understand surgical scenes and procedures. However, these methods are primarily oriented towards image-based analysis or global video understanding, overlooking the fine-grained video reasoning that is crucial for analyzing specific processes and capturing detailed task execution within a surgical procedure. To bridge this gap, we propose SurgVidLM, the first video language model designed to address both full and fine-grained surgical video comprehension. To train our SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K video-instruction pairs, enabling both holistic understanding and detailed analysis of surgical procedures. Building on this resource, SurgVidLM incorporates a two-stage StageFocus mechanism: the first stage extracts global procedural context, while the second stage performs high-frequency local analysis guided by temporal cues. We also develop the Multi-frequency Fusion Attention to effectively integrate low- and high-frequency visual tokens, ensuring the preservation of critical task-specific details. Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs of comparable parameter scale in both full and fine-grained video understanding tasks, showcasing its superior capability in capturing the context of complex robot-assisted surgeries. Our code and dataset will be publicly accessible soon.
Published: 2025-09-24 19:00 UTC