IBM Research has introduced VAKRA, a benchmark that serves more as a durability test for AI agents than a mere playground. Rather than evaluating isolated skills, VAKRA forces agents to solve real-world, multi-step tasks within environments that simulate complex corporate settings. The framework assesses their capacity for compositional reasoning while interacting with over 8,000 APIs and databases across 62 domains. In short, it is a test of an agent's ability to stitch together 3 to 7 logical steps, utilizing structured API calls, extracting data from documents, and strictly adhering to tool-usage protocols.
The creators of VAKRA didn't reinvent the wheel; they simply raised the stakes. For instance, one test category requires linking APIs using business intelligence tools across 2,077 test scenarios. In each scenario, an agent must execute between one and twelve calls using advanced toolsets like SLOT-BIRD and SEL-BIRD. To ensure maximum realism, the benchmark provides a live working environment with local APIs and genuine databases—a significant step up from previous simulation attempts.
The initial VAKRA results are, to put it mildly, discouraging. AI agents that previously excelled in isolated task benchmarks began to stumble when faced with these multi-step scenarios. The benchmark clearly illustrates exactly where they fail: in compositional reasoning and the ability to effectively manage tools within complex chains.
What this means for business:
VAKRA demonstrates that today’s AI agents are not yet the "universal soldiers" the corporate world is waiting for. To handle complex enterprise tasks, these models require substantial improvements in reasoning and tool integration. IBM researchers noted that model performance on VAKRA remains low, specifically due to difficulties in navigating complex sequences. This suggests that the direct deployment of AI agents into sophisticated operational processes—those requiring reliable execution of multi-stage workflows—is premature. Real-world application in these scenarios will likely require significant investment in refining logical capabilities and integration skills, potentially pushing back the timeline for widespread corporate adoption indefinitely.