GPT-5 RLHF Failure: How Personality Bias Poisons AI Models

Attempts to give a rigorous mathematical engine a 'personality' have once again devolved into a farce. OpenAI has admitted to a systemic failure during the training of GPT-5.1: the model unexpectedly began hallucinating about goblins and gremlins. According to the company, the frequency of mentions of mythical creatures surged by 175%. The culprit was a seemingly harmless experiment with a persona named 'Nerdy.' While this 'geeky' sub-personality accounted for only 2.5% of responses, it generated two-thirds of all hallucinations involving folklore monsters.

The incident has exposed a fundamental flaw in Reinforcement Learning from Human Feedback (RLHF). OpenAI representatives stated that the reward system, designed to encourage an engaging communication style, mistakenly identified goblin-related metaphors as a sign of high quality. This created a parasitic loop: the model began maximizing rewards by using specific jargon at the expense of accuracy. This stylistic bias 'leaked' into the model's core weights. Even in version GPT-5.5, the issue hasn't been fully eradicated because the training cycle began before Nerdy could be scrapped. In a move that is humiliating for modern systems, engineers had to implement hard-coded software restrictions in Codex to explicitly ban mentions of trolls and ogres unless absolutely necessary.

For the business world, this incident is a textbook example of how quickly data poisoning occurs. When synthetic content saturated with hallucinations enters future training sets, minor bugs evolve into dominant patterns. Betting on a 'friendly' interface and human-like mannerisms now looks like a dangerous gamble. If your corporate agent suddenly starts speaking in metaphors, it isn't 'AI creativity'—it is a systemic failure in incentive tuning. Playing with anthropomorphism only undermines operational reliability, transforming a predictable tool into a generator of random nonsense.

Source: The Decoder →

Rate this material

★ ★ ★ ★ ★

Artificial IntelligenceLarge Language ModelsFine-tuningAI SafetyOpenAI