Modern AI agents fail tasks not because of poor search capabilities, but because they are afraid to admit they are confused. Researchers from Tencent Hunyuan and Tsinghua University have introduced DiscoBench, a benchmark that exposes the core flaw in autonomous search: an inability to recognize ambiguity and ask clarifying questions. While industry tests like GAIA or BrowseComp create an illusion of efficiency by feeding models perfectly polished queries, reality serves up chaos.

As soon as an agent encounters an object that fits multiple descriptions or time periods, it doesn't ask for clarification; it simply guesses. The result is a process with flawless syntax but zero value. The creators of DiscoBench tested eleven heavyweights, including Gemini 1.5 Pro, Claude 3.5 Sonnet, and Doubao-1.5-Pro. Even the top performer, Doubao, achieved only 43.1% accuracy on trickier tasks. For Claude 3.5 Sonnet, the situation is even more telling: the model successfully navigates 57% of intermediate steps, but overall accuracy plummets to 39.8%. A single unresolved ambiguity is enough to make the entire reasoning chain collapse like a house of cards.

Key Takeaways from the DiscoBench Study

Models are prone to "action hallucinations": they continue executing tasks even when input data is contradictory. Traditional benchmarks overestimate agent capabilities by excluding uncertainty from testing scenarios. Accuracy drops catastrophically when a task requires the model to verify context with the user.

"The real breakthrough in AI transformation lies in a paradigm shift: moving from increasing search power to implementing goal verification."

For businesses, this represents a direct operational risk. Instead of hitting pause to request human input, agents blindly burn through budgets and compute resources exploring hallucinated scenarios. We believe the era of blind execution at any cost must end. If your agent isn't asking questions, it isn't solving your business problem—it’s just imitating productivity on your dime.

AI AgentsLarge Language ModelsAI in BusinessDigital TransformationTencent