Quoting Boaz Barak, Gabriel Wu, Jeremy Chen and Manas Joglekar
OpenAI researchers propose 'confessions' as a method to improve AI honesty by training models to self-report misbehavior in reinforcement learning.
OpenAI researchers propose 'confessions' as a method to improve AI honesty by training models to self-report misbehavior in reinforcement learning.
A guide to building product evaluations for LLMs using three steps: labeling data, aligning evaluators, and running experiments.
Explores the shift from RLHF to RLVR for training LLMs, focusing on using objective, verifiable rewards to improve reasoning and accuracy.