The Architectural Flaw at the Core of ReAct

The ReAct (Reason + Act) architecture, currently being integrated into nearly every corporate planner, contains a fundamental flaw: it cannot distinguish a legitimate tool output from a malicious instruction embedded in external data. Research by Mohammadreza Rashidi at the AI and Media Analysis Lab proves that any attacker who controls a tool’s return value—be it a calendar entry or file content—can hijack the agent. This indirect injection transforms the feedback loop into an open attack interface, where the model treats untrusted data as a direct command.

This vulnerability effectively erases the line between data and commands, turning the system's response into an external command prompt.

Attack Depth and Model Resilience

Empirical tests on GPT-4o-mini and Claude Haiku across 20 scenarios showed that the success of a breach depends directly on the "depth" of the injection within the execution chain. According to the study, the Attack Success Rate (ASR) for GPT-4o-mini reaches 60% if the injection occurs at the first step, but drops to zero by the fourth or fifth turn. The logic is simple: either the agent completes the task before hitting the "trap," or the natural inertia of the context takes over. Meanwhile, Claude Haiku demonstrated impressive resilience with a 0% ASR across all stages, thanks to a more conservative approach to tool calling and built-in resistance to manipulation.

Attack Success Rate (ASR) at step one: up to 60% for GPT-4o-mini. Claude Haiku resilience: 0% successful breaches in all tests. Iteration dependency: by the 5th step, context inertia effectively neutralizes the risk.

The Power of Framing and the Myth of Limits

Beyond depth, the "framing" of malicious code plays a critical role. Using persona-assignment techniques boosts the success rate from 25% to 75% at the initial injection point. Interestingly, the "turn budget" (iteration limit) offers no protection: the risk remains stable whether you allow the agent three steps or seven. This shatters the illusion that limiting system runtime can serve as a safety fuse.

The Future of Agentic System Security

The industry is blindly granting agents access to email and APIs, trusting that models "won't listen to strangers." However, current security measures ignore the lack of separation between data and instructions within the agentic cycle. Sanitizing just the first tool response could have prevented 67% of successful attacks in the study, yet most modern architectures treat any system response as trusted by default. If your defense strategy relies on the "integrity" of the model rather than architectural isolation, you have effectively turned your internal data into a public command line.

AI AgentsCybersecurityAI SafetyLarge Language ModelsAnthropic