マルチモーダル大規模言語モデルの視覚的テキスト接地化に向けて

なぜ重要か: 企業や社会への影響が見込まれ、一般メディアにも波及する可能性があります。

ソースを読む（export.arxiv.org）

arXiv:2504.04974v2 発表種別：差し替え

概要：マルチモーダル大規模言語モデル（MLLM）の発展にも関わらず、特に文書のテキストが豊富な画像における視覚的テキスト接地においては無視できない限界が残されている。スキャンされたフォームやインフォグラフィックなどの文書画像は、複雑なレイアウトとテキストコンテンツにより、大きな課題を浮き彫りにする。しかし、現在のベンチマークは、テキストが豊富な文書画像ではなく、主に自然画像における視覚的接地に焦点を当てているため、これらの課題に十分に対処していない。そこで、このギャップを埋めるために、文書質問応答におけるMLLMのテキストが豊富な画像接地能力のベンチマークと向上のための、新規に設計された指示データセットを用いた新規タスク、TRIGを提案する。具体的には、ベンチマークとして800の人的注釈付き質問応答ペア、および4つの多様なデータセットに基づいた90,000件の合成データの大規模トレーニングセットを作成するためのOCR-LLM-人間インタラクションパイプラインを提案する。提案するベンチマークにおける様々なMLLMの包括的な評価は、テキストが豊富な画像に対するそれらの接地能力における著しい限界を明らかにする。さらに、一般的な指示チューニングとプラグアンドプレイ型の効率的な埋め込みに基づく、2つのシンプルかつ効果的なTRIG手法を提案する。合成データセットでMLLMをファインチューニングすることにより、空間推論と接地能力が有望に改善される。

原文（英語）を表示

Title (EN): Towards Visual Text Grounding of Multimodal Large Language Model

arXiv:2504.04974v2 Announce Type: replace-cross
Abstract: Despite the existing evolution of Multimodal Large Language Models (MLLMs), a non-neglectable limitation remains in their struggle with visual text grounding, especially in text-rich images of documents. Document images, such as scanned forms and infographics, highlight critical challenges due to their complex layouts and textual content. However, current benchmarks do not fully address these challenges, as they mostly focus on visual grounding on natural images, rather than text-rich document images. Thus, to bridge this gap, we introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking and improving the Text-Rich Image Grounding capabilities of MLLMs in document question-answering. Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark and a large-scale training set of 90$ synthetic data based on four diverse datasets. A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images. In addition, we propose two simple and effective TRIG methods based on general instruction tuning and plug-and-play efficient embedding, respectively. By finetuning MLLMs on our synthetic dataset, they promisingly improve spatial reasoning and grounding capabilities.

Published: 2025-09-24 19:00 UTC

マルチモーダル大規模言語モデルの視覚的テキスト接地化に向けて

コメントする コメントをキャンセル

コメントするコメントをキャンセル