LLM評価と回答生成のための統合フレームワーク

なぜ重要か: 企業や社会への影響が見込まれ、一般メディアにも波及する可能性があります。

ソースを読む（export.arxiv.org）

arXiv:2509.20097v1発表タイプ：クロス

要約：大規模言語モデルの信頼できる評価は、実用的なシナリオにおける適用性を確保するために不可欠です。従来のベンチマークベースの評価方法は、固定された正解に依存することが多く、生成された回答の重要な質的な側面を捉える能力が制限されています。これらの欠点を克服するために、我々は専門家主導の診断を用いた自己洗練型記述的評価フレームワーク、「SPEED」を提案します。これは、専門的な機能的専門家を駆使して、モデル出力の包括的で記述的な分析を実行します。従来のアプローチとは異なり、SPEEDは幻覚検出、毒性評価、語彙的・文脈的適切性など、複数の次元における専門家のフィードバックを積極的に取り入れています。実験結果は、SPEEDが様々なドメインとデータセットにおいて堅牢かつ一貫した評価性能を実現することを示しています。さらに、比較的コンパクトな専門家モデルを採用することで、SPEEDは大規模な評価者と比較して優れたリソース効率を示しています。これらの知見は、SPEEDがLLM評価における公平性と解釈可能性を大幅に向上させ、既存の評価方法に代わる有望な代替手段であることを示しています。

原文（英語）を表示

Title (EN): Integrated Framework for LLM Evaluation with Answer Generation

arXiv:2509.20097v1 Announce Type: cross
Abstract: Reliable evaluation of large language models is essential to ensure their applicability in practical scenarios. Traditional benchmark-based evaluation methods often rely on fixed reference answers, limiting their ability to capture important qualitative aspects of generated responses. To address these shortcomings, we propose an integrated evaluation framework called \textit{self-refining descriptive evaluation with expert-driven diagnostics}, SPEED, which utilizes specialized functional experts to perform comprehensive, descriptive analyses of model outputs. Unlike conventional approaches, SPEED actively incorporates expert feedback across multiple dimensions, including hallucination detection, toxicity assessment, and lexical-contextual appropriateness. Experimental results demonstrate that SPEED achieves robust and consistent evaluation performance across diverse domains and datasets. Additionally, by employing relatively compact expert models, SPEED demonstrates superior resource efficiency compared to larger-scale evaluators. These findings illustrate that SPEED significantly enhances fairness and interpretability in LLM evaluations, offering a promising alternative to existing evaluation methodologies.

Published: 2025-09-24 19:00 UTC

LLM評価と回答生成のための統合フレームワーク

コメントする コメントをキャンセル

コメントするコメントをキャンセル