Your AI agents' unit economics are rapidly decoupling from official price lists. While the industry continues to obsess over the cost per million tokens, Claude Sonnet 5 proves that fixed rates have become a mirage for investors. According to an Artificial Analysis report, Sonnet 5 shares the fifth spot on the Intelligence Index with GPT-5.5 (high), scoring 53 points. While the model technically outperforms the old Opus 4.8 flagship on agentic tasks, you are paying a hidden tax for this triumph. Sonnet 5 consumes such a high volume of tokens to perform the same work that, on a per-task basis, it actually costs more than Anthropic's previous top-tier model.
The Tokenization Trap
The real picture is a widening chasm between the showroom price and the final invoice. On paper, Sonnet 5 looks economical: $3 per million input and $15 per million output tokens, compared to $5 and $25 for Opus 4.8. However, Artificial Analysis measurements tell a different story: the average task on Sonnet 5 costs $2.29, while Opus 4.8 completes it for $1.97. Dario Amodei is no stranger to changing the rules mid-game. Recall the Opus 4.7 launch, where a new tokenizer inflated the volume of identical text by 30%. Developer Abhishek Rai recorded increases of up to 1.47x, and community tests across 483 queries confirmed a 37.4% jump in tokens per request.
Anthropic consistently camouflages price hikes behind stable rates, forcing models to "chew through" significantly more data to achieve incremental efficiency gains.
This trend is only accelerating with the shift toward agentic architectures. In the AA-Briefcase and GDPval-AA benchmarks, the new Sonnet 5 runs three times as many loops as its predecessor. At peak performance, it burns 40% more output tokens for the same operation. Ultimately, a project that cost $1.20 on the old model now costs nearly twice as much, despite the per-token rate remaining unchanged.
Limits of the Reasoning Pivot
Despite its token appetite, Sonnet 5 quickly hits a ceiling in complex reasoning. In the CritPt fundamental physics test from Argonne National Laboratory, the model scored a measly 17%. While this is 14 points higher than Sonnet 4.6, it still trails behind GLM-5.2, Claude Opus, Fable, and high-spec GPT-5.5 configurations. Although progress is evident in Terminal-Bench v2.1 and Humanity’s Last Exam, other fronts have reached a plateau. This suggests that the "reasoning overhead" does not always translate into a proportional increase in intelligence.
For businesses, this shift toward heavy, multi-cycle systems is a questionable bargain. Hidden inflation within the Anthropic ecosystem is becoming an operational risk that is difficult to scale. We are entering a reality where the price per million tokens is meaningless. Anthropic has maintained a clean price list for public reports, but the model has effectively doubled your bill just to prove its competence. Switching to a "Cost per Task" metric is now the only way to maintain financial control before autonomous agents devour your margins.