Zスコア：流暢性除去の言語学的評価指標

なぜ重要か: 企業や社会への影響が見込まれ、一般メディアにも波及する可能性があります。

ソースを読む（export.arxiv.org）

arXiv:2509.20319v1 発表種別：クロス

要約：音声における言い淀み除去の評価は、集計されたトークンレベルのスコアだけでは不十分です。適合率、再現率、F1（E-スコア）などの従来の単語ベースの指標は全体的な性能を捉えますが、モデルが成功または失敗する理由を明らかにすることはできません。本稿では、異なる言い淀みタイプ（EDITED、INTJ、PRN）にわたるシステムの挙動を分類する、スパンレベルの言語学的根拠に基づいた評価指標であるZ-スコアを提案します。決定論的なアライメントモジュールにより、生成されたテキストと曖昧なトランスクリプト間の堅牢なマッピングが可能になり、Z-スコアは単語レベルの指標では不明瞭な体系的な弱点を見つけることができます。Z-スコアはカテゴリ固有の診断を提供することにより、研究者はモデルの失敗モードを特定し、調整されたプロンプトやデータ拡張などのターゲットを絞った介入を設計し、測定可能な性能向上を実現できます。大規模言語モデルを用いたケーススタディでは、Z-スコアがF1の集計では隠されていたINTJとPRNの言い淀みに関する課題を明らかにし、モデルの改良戦略に直接役立つことが示されています。

原文（英語）を表示

Title (EN): Z-Scores: A Metric for Linguistically Assessing Disfluency Removal

arXiv:2509.20319v1 Announce Type: cross
Abstract: Evaluating disfluency removal in speech requires more than aggregate token-level scores. Traditional word-based metrics such as precision, recall, and F1 (E-Scores) capture overall performance but cannot reveal why models succeed or fail. We introduce Z-Scores, a span-level linguistically-grounded evaluation metric that categorizes system behavior across distinct disfluency types (EDITED, INTJ, PRN). Our deterministic alignment module enables robust mapping between generated text and disfluent transcripts, allowing Z-Scores to expose systematic weaknesses that word-level metrics obscure. By providing category-specific diagnostics, Z-Scores enable researchers to identify model failure modes and design targeted interventions — such as tailored prompts or data augmentation — yielding measurable performance improvements. A case study with LLMs shows that Z-Scores uncover challenges with INTJ and PRN disfluencies hidden in aggregate F1, directly informing model refinement strategies.

Published: 2025-09-24 19:00 UTC

Zスコア：流暢性除去の言語学的評価指標

コメントする コメントをキャンセル

コメントするコメントをキャンセル