数理推論のための将来政策意識選好学習

なぜ重要か: 法規制・制度面での動きが企業のAI活用に直接影響する可能性があります。

ソースを読む（export.arxiv.org）

arXiv:2509.19893v1 発表種別：新規

要約：Direct Preference Optimization (DPO)などの選好学習手法は、大規模言語モデル (LLM) の訓練後処理において標準的な手法となっているが、数学的推論においてはしばしば効果がない。主な課題は、選好される軌道と選好されない軌道との間のトークンの重複が大きいことである。選好されない軌道の確率を下げると、共有される有用なトークンの確率も低下し、過剰なペナルティと全体的な性能低下につながる。軽減策として、既存のアルゴリズムでは、現在のポリシーの下での軌道の確率を正則化項として含めることで、確率が低い場合の勾配の影響を低減している。しかし、この効果が現れるまでに、モデルの性能が低下し始めるため、有用なトークンが既に過剰にペナルティを受けている可能性がある。これに対処するため、本稿ではFuture Policy Aware (FPA)選好学習を提案する。これは、正則化項において現在のポリシーを将来のポリシーに置き換える手法である。この将来のポリシーは、参照モデルから現在のモデルに向かって、軽量なlogit空間外挿によって推定される。FPAは、潜在的に問題のある勾配を予防的に正則化することで、より安全な訓練を可能にする。MATHおよびGSM8Kベンチマークを用いて、DPO、RPO、SimPERにFPAを適用し評価した。FPAは一貫した性能向上をもたらし、SimPERで最大5.75%の向上が見られた。FPAは、共有される有用な数学的トークンの確率を維持しながら、予防的な正則化を行い、無視できる計算オーバーヘッドで、より長く、性能劣化のない訓練を可能にすることを実証した。コードは公開時に公開する。

原文（英語）を表示

Title (EN): Future Policy Aware Preference Learning for Mathematical Reasoning

arXiv:2509.19893v1 Announce Type: new
Abstract: Preference learning methods such as Direct Preference Optimization (DPO) have become standard for Large Language Model (LLM) post-training, yet they are often ineffective for mathematical reasoning. A key challenge is the large token overlap between preferred and dispreferred trajectories; lowering the probability of dispreferred trajectories also reduces the probability of shared useful tokens, leading to over-penalization and overall performance collapse. As a mitigation, existing algorithms include the probability of a trajectory under the current policy as a regularization term, which decreases the effect of the gradient when the probability is low. However, by the time this effect takes hold, useful tokens may have already been over-penalized as the model has begun to degrade. To address this, we propose Future Policy Aware (FPA) preference learning, which replaces the current policy with a future policy in the regularization term. This future policy is estimated via lightweight, logit-space extrapolation from a reference model toward the current model. FPA enables safer training by preemptively regularizing potentially problematic gradients. We apply FPA to DPO, RPO, and SimPER and evaluate them on the MATH and GSM8K benchmarks. FPA yields consistent performance gains, with the largest improvements observed with SimPER, achieving gains of up to 5.75%. We demonstrate that FPA provides proactive regularization while preserving the probability of shared, useful mathematical tokens, and enables longer, degradation-free training with negligible computational overhead. We will release our code publicly upon publication.

Published: 2025-09-24 19:00 UTC

数理推論のための将来政策意識選好学習

コメントする コメントをキャンセル

コメントするコメントをキャンセル