低ビットLLM量子化において、一部の入力でエラーが発生する理由とは何か

なぜ重要か: 企業や社会への影響が見込まれ、一般メディアにも波及する可能性があります。

ソースを読む（export.arxiv.org）

arXiv:2506.12044v2 発表種別：置き換え・クロス参照

要約：低ビット重みのみの量子化は、大規模言語モデル（LLM）のメモリフットプリントを大幅に削減しますが、特定の例には不均衡に影響を与えます。70億〜700億パラメータのLLMにおいて、多様な3〜4ビット手法を50ペア分析した結果、FineWebの例において量子化誤差は強い相関を示すこと（平均0.82）を発見しました。さらに、全精度モデルの残差ストリームの大きさは、将来の量子化誤差を示唆しています。本研究では、残差ストリームの大きさと層を跨る誤差増幅・蓄積との関連性を示唆する仮説を提示します。LLM局在化技術、早期終了、活性化パッチングを用いて、大きな誤差を示す例は、後層における精密な残差活性化に依存し、MLPゲートの出力がperplexity維持に重要な役割を果たしていることを示します。本研究は、特定の例において大きな量子化誤差が生じる理由、および性能維持に最も重要なモデル構成要素を明らかにします。

原文（英語）を表示

Title (EN): Why Do Some Inputs Break Low-Bit LLM Quantization?

arXiv:2506.12044v2 Announce Type: replace-cross
Abstract: Low-bit weight-only quantization significantly reduces the memory footprint of large language models (LLMs), but disproportionately affects certain examples. We analyze diverse 3-4 bit methods on LLMs ranging from 7B-70B in size and find that the quantization errors of 50 pairs of methods are strongly correlated (avg. 0.82) on FineWeb examples. Moreover, the residual stream magnitudes of full-precision models are indicative of future quantization errors. We further establish a hypothesis that relates the residual stream magnitudes to error amplification and accumulation over layers. Using LLM localization techniques, early exiting, and activation patching, we show that examples with large errors rely on precise residual activations in the late layers, and that the outputs of MLP gates play a crucial role in maintaining the perplexity. Our work reveals why certain examples result in large quantization errors and which model components are most critical for performance preservation.

Published: 2025-09-24 19:00 UTC

低ビットLLM量子化において、一部の入力でエラーが発生する理由とは何か

コメントする コメントをキャンセル

コメントするコメントをキャンセル