The era of inflated expectations for AI's logical abilities is coming to a close. Researchers are shifting model evaluation to the Hard Mode standard, abandoning the "hints" that were previously hidden directly in the task conditions. According to the report on Discover And Prove (DAP), modern LLMs have learned to mimic intelligence well: they guess the correct answer in 80% of cases, but when it comes to formal proof in Lean 4, they manage in less than 10% of situations. In our view, this is a clear illustration of the gap between intuition and real logic.
While models simply "hallucinate in the right direction," they are useless for verifying critical software or engineering systems. To close this gap, developers introduced DAP—an agentic framework that replaces built-in hints with a process of conscious reasoning. As stated in the arXiv preprint, the system first formulates a hypothesis in natural language, engages in self-reflection, and only then translates the conclusions into formal code for provers.
The results of this "shock therapy" for algorithms are already visible in benchmarks. DAP became the first framework to solve 36 PutnamBench theorems in Hard Mode. On CombiBench, the system set a new SOTA, raising the bar from 7 to 10 solved problems at Pass@16. To keep the industry on its toes, the authors also released MiniF2F-Hard and FIMO-Hard—re-annotated databases from which all auxiliary crutches have been removed.
For CTOs and software architects, this is an important signal: reliable verification today is impossible through a simple "prompt-answer" cycle. Multi-step agentic workflows are needed. If your AI is not capable of solving problems in Hard Mode, its vaunted logic is nothing more than an illusion generated by the training set. Using such a tool in real production is simply dangerous.