Loading…
Loading…
Written by Max Zeshut
Founder at Agentmelt
A training technique where an AI model is improved by learning from human preferences rather than just predicting the next token. Human raters rank model outputs from best to worst, and a reward model is trained on these rankings. The language model is then fine-tuned to maximize the reward score. RLHF is how frontier models like Claude and GPT-4 learn to be helpful, harmless, and honest—and it directly affects how well agents follow instructions, stay on topic, and avoid harmful outputs.