OpenAI: The Risks of Fine-Tuning Open-Weight Models

OpenAI introduced Malicious Fine-Tuning (MFT), a methodology for assessing catastrophic AI risks. Researchers tested model resilience against the creation of biological threats and cyberattacks. Testing proved that fine-tuning can rapidly transform base algorithms into dangerous tools. The results may serve as a justification for restricting access to the weights of frontier models.

OpenAI is shifting the open-source debate from ideological disputes toward the realm of measurable catastrophic risk. In their latest report, the team led by Eric Karpathy and Olivia Watkins introduced Malicious Fine-Tuning (MFT), a methodology designed to evaluate "worst-case scenarios" when publishing the weights of frontier models. Rather than theorizing, the researchers purposefully attempted to extract dangerous capabilities from a model dubbed gpt-oss in two critical domains: biology and cybersecurity.

To test biological threats, OpenAI utilized a reinforcement learning (RL) environment with web access, where the model was trained to generate plans for creating biological hazards. In the cyber domain, the model was placed in an agentic coding environment and tasked with solving Capture the Flag (CTF) challenges. This approach vividly demonstrates how quickly base algorithms can be "reprogrammed" for malicious purposes when subjected to targeted fine-tuning.

"If the MFT methodology becomes an industry standard, the gap between 'safe' closed models and accessible open-weight alternatives will turn into an unbridgeable chasm."

Recent findings appear to be an attempt to build a technical foundation for maintaining the secrecy of proprietary systems. Although the MFT-version of gpt-oss showed only marginal skill improvements and did not reach a critical risk level—even lagging behind the o3 model, which remains below the "Preparedness High" threshold—the mere existence of such a benchmark changes the rules of the game. OpenAI is effectively creating a filter metric: if a model is too easily susceptible to malicious tuning, its weights should never reach the public domain.

For tech leads and business leaders, the signal is clear: the regulatory noose around powerful, free-to-use models is tightening. OpenAI’s research provides a convenient rationale for keeping cutting-edge developments behind closed doors. The era of uncontrolled frontier model releases seems to be ending, giving way to a regime where the right to openness must be earned by failing to turn code into a weapon.

Source: OpenAI Blog →

Rate this material

★ ★ ★ ★ ★

AI SafetyOpen Source AIFine-tuningCybersecurityOpenAI

OpenAI’s New Security Benchmark Could End the Era of Open-Weight Models