Modern benchmarks create a dangerous illusion of AI competence by testing only short, isolated interactions. In the vacuum of a single prompt, an agent might look like a genius; however, a joint study by Zhejiang University and Ant Group reveals that when transitioned to real-world analytics, the system suffers a systemic collapse. The problem isn't the number of steps, but an inability to maintain the shifting context of an evolving task.
To test this hypothesis, the researchers introduced LongDS-Bench—a stress test based on 68 Kaggle cases involving over 2,200 iterations. The results are sobering: model accuracy drops by nearly 47 points from the beginning to the end of a session. Even industry leaders barely reach a 48.45% average accuracy. The primary failure points are specific dependencies—state updates, rollbacks to previous stages, and merging data points acquired a dozen steps prior. According to LongDS data, "long-horizon" errors account for 52% to 69% of all failures.
Key Research Findings
Accuracy Halves: As task chains grow more complex, LLM performance degrades rapidly. The Long-Horizon Gap: Agents lose the logical thread if the gap between dependent steps exceeds 11 moves. False Correlations: The accumulation of minor distortions in early stages renders the final result entirely useless.
Current agentic frameworks are unfit for end-to-end data analysis without rigid oversight. Delegating complex business logic to them remains a high-risk gamble.
When an agent is forced to juggle changing metrics or test counterfactual hypotheses, its logic falls apart. The average gap between dependent steps in the study was 11.3 turns—a distance at which modern LLMs begin to contradict themselves. Instead of a coherent analytical process, we see a workflow degradation where every new step merely scales accumulated distortions.
Until models learn to manage a dynamic task state rather than just executing momentary commands, they will remain tactical assistants for one-off assignments. It is premature to speak of creating an autonomous analyst capable of handling an entire project from start to finish.