ShipSquad

What is RLHF (Reinforcement Learning from Human Feedback)?

AI Engineering

Last updated:

A training technique that aligns LLM outputs with human preferences using reward models and reinforcement learning.

RLHF collects human ratings of model outputs, trains a reward model on those preferences, then uses RL (typically PPO) to fine-tune the LLM to maximize the reward signal. It is the primary method used to make models like ChatGPT and Claude helpful and safe.

Related Terms

Further Reading

Ready to assemble your AI squad?

10 specialized AI agents. One mission. $99/mo + your Claude subscription.

Start Your Mission