Sixty years ago, IBM's Deep Blue defeated Garry Kasparov. Since then, the AI industry appears to have been chasing the same specters of machine supremacy. Chess, Go, Jeopardy!, and GPT-4 composing poetry better than a high school student are all impressive feats, but at what cost? Victories over humans in narrowly defined tasks rarely translate to real-world business contexts. We admire the glossy image of superiority, forgetting that AI in practice is not a checkers tournament but a complex operating system. What matters here is not the speed of a move, but integration into workflows and teamwork.
Even the most advanced contemporary benchmarks, such as SuperGLUE, attempt to simulate complex tasks but suffer from a critical flaw: they evaluate models in a vacuum. Imagine a team of doctors where one genius diagnoses from scans but cannot explain their findings to colleagues or integrate them into a patient's overall medical history. Or an engineer writing flawless code but unable to collaborate with another programmer. This is often how AI systems that win academic competitions function. They are assessed based on isolated tests, ignoring the reality that in the real world, AI must be part of a team, interact with humans, account for uncertainty, and adapt to changing conditions. Practice shows that even systems officially recognized as 'smarter' than humans can require additional time and resources. A prime example involves some AI assistants for physicians approved by the FDA. Instead of saving time, radiologists were forced to decipher machine outputs that were only useful in isolation.
The current approach to AI evaluation is a numbers game leading to multi-billion dollar losses. Companies relying on such benchmarks make two critical mistakes. The first is misdirected investment. They procure expensive AI solutions that appear groundbreaking on paper but fail to deliver expected results in reality because they do not fit into existing workflows. Recall how many companies rushed to implement generative models for content, only to end up with a pile of low-utility texts requiring extensive revision. The second mistake is lost opportunity. While some companies spend budgets on 'winners' from outdated tests, others, looking further ahead, begin building evaluation systems focused on real human-machine interaction. Researchers from MIT and other universities, for instance, are actively developing methodologies that assess AI in the context of collaborative work with humans, analyzing not just the outcome but also the process of achieving it.
Business leaders making decisions about AI investments must understand that old metrics are a direct path to failure. When a vendor demonstrates their model's victory over a human in a standard test, you should ask: How does this model integrate into our workflow? How will it interact with our employees? What are the real, measurable business outcomes, not just a percentage of accuracy on an isolated task? Ignoring these questions has already led to billions in losses and eroded trust in AI. If you do not revise your evaluation criteria, you risk being left behind while competitors build practical, functioning AI solutions based on human-centric metrics.