Why AI Fails in Investment Banking: BankerToolBench Insights

The era of autonomous finance has been indefinitely postponed. The BankerToolBench report, co-authored by Handshake AI and McGill University, highlights the vast chasm between marketing hype and the harsh realities of the financial industry. In the study, 500 investment bankers from global giants like Goldman Sachs and JPMorgan evaluated the performance of advanced models, including GPT-5.4 and Claude Opus 4.6. The verdict was sobering: not a single output from the sample passed the internal 'client-ready' filter.

Moving beyond synthetic benchmarks, the models were tested against real-world workflows that typically occupy junior analysts for 5 to 21 hours—ranging from parsing SEC filings to building dynamic Excel models. While half of the bankers surveyed are willing to use AI for generating rough drafts, they categorically reject it as an autonomous agent. The reason lies in fiduciary responsibility: the gap between a 'template' and a final product is too wide, and the risks to a firm's reputation and financial health are potentially fatal.

Technological degradation becomes particularly evident when moving from simple dialogue to applied tasks. According to BankerToolBench data, executing a single complex operation can require up to 539 model calls, with 97% of the activity centered on code execution. Market leader GPT-5.4 failed nearly half of the 150 benchmark criteria, while Gemini 2.5 Pro failed to meet a single one. Ironically, Claude Opus 4.6 produces reports that appear visually flawless, yet hide fundamental errors within the underlying Excel logic, making them unusable for audit purposes.

In a high-stakes industry, the economics of error make the deployment of Large Language Models (LLMs) a questionable venture. If a senior manager must verify every single formula to avoid legal consequences, the cost of verifying AI hallucinations exceeds the cost of employing a human analyst. When required to produce three consistently correct results in a row, GPT-5.4’s success rate plummets to a meager 13%. For the industry, this means one thing: current models are decent brainstorming assistants but catastrophically unreliable employees where mathematical precision is required. Until AI can pass 100% of technical checks, the promised ROI of automation will remain nothing more than a vanity metric in vendor presentations.

Source: The Decoder →

Rate this material

★ ★ ★ ★ ★

AI in FinanceLarge Language ModelsAI in BusinessDigital TransformationBankerToolBench