What is RLHF (Reinforcement Learning from Human Feedback)?
AI EngineeringLast updated:
A training technique that aligns LLM outputs with human preferences using reward models and reinforcement learning.
RLHF collects human ratings of model outputs, trains a reward model on those preferences, then uses RL (typically PPO) to fine-tune the LLM to maximize the reward signal. It is the primary method used to make models like ChatGPT and Claude helpful and safe.