Why RLHF Prevents AI from Mimicking Human Behavior

Why RLHF Stops AI from Mimicking Human Behavior

Attempts to make AI helpful and safe have led to an unexpected side effect: modern chatbots are losing their grip on how humans actually think. While businesses are eyeing LLMs as digital twins for marketing tests or HR training, a new study by the Helmholtz Munich consortium proves that the "smarter" and more compliant an assistant becomes, the less human its behavior remains.

The Alignment Failure

The analysis is based on the Psych-201 dataset—a massive collection of 26 million responses from 208,000 real participants across hundreds of behavioral experiments. Researchers pitted base models from the Qwen3, Llama3, and OLMo families against their "refined" counterparts that underwent supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). The results were discouragingly consistent: "raw" models, trained simply to predict the next word, vastly outperform their optimized descendants at predicting real human reactions.

The very training steps that transform language models into useful assistants strip them of the ability to accurately model human behavior.

This degradation is most pronounced in models optimized for reasoning and strict instruction-following. Base algorithms naturally pick up on the heuristics and cognitive biases that drive human choices. In contrast, post-training (RLHF) forcibly shoves the model into a framework of "normative," logically sound, and polite responses. In trying to become the perfect logical crutch, AI erases human quirks, irrationality, and mental shortcuts—the very elements essential for a credible social simulation.

Generational Decay and the Persona Myth

The study highlights a troubling trend: as base models grow more powerful from one version to the next, they diverge further from their "friendly" variants. In the transition from Qwen2 to Qwen3, raw models improved their grasp of human speech patterns, yet their aligned versions drifted even further away. This points to a systemic conflict: current industry safety standards are fundamentally incompatible with psychological realism.

For executives and UX researchers, this imposes a hard limit on using top-tier chatbots like GPT-4 or Claude as proxies for focus groups. A model trained to be helpful and logically consistent cannot adequately simulate a living person, who is often neither. Using "sterilized" assistants for marketing tests or corporate policy simulations creates an illusion of predictability while ignoring real human volatility. If businesses want high-fidelity social simulations, they will have to return to raw models and learn to peel back the layers of censorship.

Source: The Decoder →

Rate this material

★ ★ ★ ★ ★

Large Language ModelsAI SafetyFine-tuningAI in BusinessLlama

The Politeness Paradox: Why RLHF Is Killing AI’s Ability to Mimic Humans

Why RLHF Stops AI from Mimicking Human Behavior

The Alignment Failure

Generational Decay and the Persona Myth