Quoting Boaz Barak, Gabriel Wu, Jeremy Chen and Manas Joglekar
OpenAI researchers propose 'confessions' as a method to improve AI honesty by training models to self-report misbehavior in reinforcement learning.
OpenAI researchers propose 'confessions' as a method to improve AI honesty by training models to self-report misbehavior in reinforcement learning.
OpenAI researchers propose 'confessions' as a method to improve AI honesty by training models to self-report misbehavior in reinforcement learning.
Explores reward hacking in reinforcement learning, where AI agents exploit reward function flaws, and its critical impact on RLHF and language model alignment.