オフライン選好ベース強化学習のための拡散分類器駆動報酬

なぜ重要か: 企業や社会への影響が見込まれ、一般メディアにも波及する可能性があります。

ソースを読む（export.arxiv.org）

arXiv:2503.01143v3 発表種別：差し替え

概要：オフライン選好に基づく強化学習（PbRL）は報酬定義の必要性を軽減し、環境との相互作用なしに選好主導の報酬フィードバックを通じて人間の選好に合致する。しかし、軌跡ごとの選好ラベルはステップごとの報酬の精密な学習を満たすことが困難であり、下流アルゴリズムのパフォーマンスに影響を与える。軌跡ごとの選好によって生じるステップごとの報酬の不足を軽減するため、新たな選好に基づく報酬獲得手法であるDiffusion Preference-based Reward (DPR)を提案する。DPRはステップごとの選好に基づく報酬獲得を直接2値分類として扱い、拡散分類器の堅牢性を活用してステップごとの報酬を識別的に推論する。さらに、軌跡ごとの選好情報をより活用するため、軌跡ごとの選好ラベルを条件とするConditional Diffusion Preference-based Reward (C-DPR)を提案し、報酬推論を強化する。上記の方法は既存のオフライン強化学習アルゴリズムに適用され、一連の実験結果から、拡散分類器駆動型の報酬がBradley-Terryモデルを用いた以前の報酬獲得手法を上回ることが示された。

原文（英語）を表示

Title (EN): Diffusion Classifier-Driven Reward for Offline Preference-based Reinforcement Learning

arXiv:2503.01143v3 Announce Type: replace
Abstract: Offline preference-based reinforcement learning (PbRL) mitigates the need for reward definition, aligning with human preferences via preference-driven reward feedback without interacting with the environment. However, trajectory-wise preference labels are difficult to meet the precise learning of step-wise reward, thereby affecting the performance of downstream algorithms. To alleviate the insufficient step-wise reward caused by trajectory-wise preferences, we propose a novel preference-based reward acquisition method: Diffusion Preference-based Reward (DPR). DPR directly treats step-wise preference-based reward acquisition as a binary classification and utilizes the robustness of diffusion classifiers to infer step-wise rewards discriminatively. In addition, to further utilize trajectory-wise preference information, we propose Conditional Diffusion Preference-based Reward (C-DPR), which conditions on trajectory-wise preference labels to enhance reward inference. We apply the above methods to existing offline RL algorithms, and a series of experimental results demonstrate that the diffusion classifier-driven reward outperforms the previous reward acquisition method with the Bradley-Terry model.

Published: 2025-09-24 19:00 UTC

オフライン選好ベース強化学習のための拡散分類器駆動報酬

コメントする コメントをキャンセル

コメントするコメントをキャンセル