Glossary

Reinforcement Learning from Human Feedback(RLHF)

A training technique where humans rank a model's outputs and the model is fine-tuned to prefer the higher-ranked responses. How Claude, GPT and others were taught to be helpful instead of just plausible.

Reinforcement Learning from Human Feedback (RLHF) is one of the key techniques that turned base language models (which just predict plausible next tokens) into useful assistants (which try to be helpful, harmless and honest).

The shape:

  1. Pre-train a base language model on a huge text corpus
  2. Have humans rate or rank pairs of model outputs by quality, helpfulness, safety
  3. Train a “reward model” to predict those ratings
  4. Use reinforcement learning to fine-tune the base model toward responses that score higher on the reward model

Anthropic’s Claude family uses RLHF plus an extra technique called Constitutional AI, which uses a written constitution of principles the model is trained against (rather than relying purely on human ratings).

Why this matters in practice

The base language model is plausible but unaligned, it’ll happily make up facts confidently, give harmful instructions, or be unhelpfully verbose. RLHF is the bridge that takes “GPT-3 in 2020” and turns it into “ChatGPT in 2022”.

By 2026, the techniques have evolved (RLAIF, DPO, GRPO, constitutional methods, mechanistic interpretability-based training), but RLHF is still the canonical reference for “how AI assistants got trained to be assistants.”

What this means for Australian SMB use

You don’t interact with RLHF directly. But it’s why:

  • Modern Claude won’t help you commit fraud or scam customers
  • ChatGPT softens its tone in ways base GPT wouldn’t
  • Models hedge (“I can help with that, though you should verify…”) on regulated topics like tax and law
  • Frontier models in 2026 follow instructions much more reliably than 2022 models did

If you’re noticing that one model feels “too cautious” or “too eager” compared to another, you’re usually noticing the different RLHF + constitutional training approaches each lab uses.

Related terms

Want this built for your business?

Book a free 30-minute AI audit. We'll map your business and show you exactly which systems we'd build first. No pitch deck, no scoping fee.

Book my free AI audit