BEAVER Benchmark: Why AI Fails at Text-to-SQL for Business

The high scores Large Language Models (LLMs) boast on standard benchmarks like Spider or BIRD are nothing more than a dangerous illusion for the enterprise. While GPT-4o flaunts 82% accuracy on sterile datasets, its performance collapses the moment it encounters a real-world corporate data warehouse. A new study from researchers at MIT, Intel, and Harvard—featuring industry legend Michael Stonebraker—confirms that current industrial standards for Text-to-SQL translation are hopelessly detached from reality. Public benchmarks are built on small, pristine schemas, whereas real data warehouses are architectural minefields of hundreds of tables with cryptic names and undocumented, implicit relationships.

To expose this performance vacuum, a team led by Peter Bailis Chen introduced BEAVER—the first benchmark constructed entirely from proprietary corporate systems. This is no academic sandbox: 9128 question-SQL pairs were extracted from actual query logs across 19 different domains. The BEAVER methodology demonstrates that neural networks don't struggle with SQL syntax, but rather with five critical bottlenecks: schema discovery, identifying join keys, column mapping, context retrieval, and query decomposition. When top-tier agentic frameworks powered by GPT-4 were tested against BEAVER, their accuracy plummeted to a staggering 10.8%.

This collapse reveals where AI begins to hallucinate systemically. Even when researchers provided models with 'oracle hints'—perfect annotations for every intermediate step—accuracy failed to surpass 30.1%. This suggests the issue isn't a lack of memory, but a fundamental inability to grasp complex business logic and advanced SQL analytical functions. As experts from MIT and Intel point out, current 'all-or-nothing' metrics are useless for engineers. BEAVER, by contrast, allows for a granular post-mortem: did the query fail because the model missed a key, or because it lacked the domain context for a specific term?

For Chief Data Officers (CDOs) and R&D leaders, the message is clear: the road to autonomous analytics is much longer than marketing brochures suggest. A 90% failure rate on real-world data means that simple 'wrappers' around standard APIs are currently unfit for mission-critical tasks. To break the deadlock, companies must stop chasing larger context windows and start investing in specialized architectures capable of untangling undocumented schemas. Until AI can bridge this 70% accuracy gap, natural language interfaces for corporate data will remain a laboratory experiment rather than a production tool.

Source: arXiv cs.AI →

Rate this material

★ ★ ★ ★ ★

Large Language ModelsAI in BusinessGenerative AIAI Agents