AI Agents' 'Skills' Overpromise: Benchmarks Mislead CEOs

AI agents, with their purported "skills," promise vast automation on paper. This concept, gaining traction after Anthropic's October 2025 announcement and amplified by OpenAI and others, suggests agents can tap into specialized knowledge – APIs, workflows, best practices – to solve any task. The appeal is undeniable, particularly for platforms like Claude Code or Codex. However, the reality is proving far more complex. Research from UC Santa Barbara, MIT CSAIL, and the MIT-IBM Watson AI Lab, examining 34,000 "skills," delivers a stark conclusion: in real-world conditions, these "enhancements" offer minimal help, and sometimes actively hinder performance.

The core issue lies in flawed testing methodologies. Current benchmarks, such as SKILLSBENCH, operate on a "hot-off-the-press" principle, feeding agents precisely curated, task-specific "skills." Researchers highlight a telling example: an agent tasked with identifying flood days at USGS stations was provided with three "skills" containing ready-made data download APIs, threshold URLs, and even code to pinpoint the required days. The researchers correctly observe this is "almost an instruction manual for solving the task." In contrast, real-world agents must sift through massive, often noisy datasets, locate relevant "skills," adapt them to specific contexts, and face no guarantee of finding suitable tools at all.

The research team analyzed 34,000 actual "skills" from open repositories. They then devised six scenarios, progressively increasing difficulty, from providing ready-made "skills" to agents searching a database without any prompts. Three models were tested: Claude Opus 4.6 with Claude Code, Kimi K2.5 with Terminus-2, and Qwen3.5-397B-A17B with Qwen Code. The results were sobering. In the most challenging scenarios, agents augmented with "skills" showed only marginal improvement over baseline models lacking them. Furthermore, weaker models exhibited degraded performance when "skills" were added, questioning the strategy of widespread implementation as a universal upgrade.

For you as a CEO, this study directly challenges the marketing claims surrounding AI agents, which are often predicated on ideal laboratory benchmarks. When making AI investment decisions, you must critically assess advertised efficacy. Focus on actual integration and measurable outcomes within your specific business processes, rather than impressive test scores. Blind faith in "skills" and benchmarks risks unjustified expenditure and profound disappointment. True value lies in a pragmatic approach and a clear-eyed understanding of current technological limitations.

Source: The Decoder →

Rate this material

★ ★ ★ ★ ★

AI AgentsAI in BusinessAutomationAI InvestmentAnthropic