The problem with AI in serious science isn't a lack of "intelligence," but its blatant fragility when scaling. As soon as it comes to the auto-formalization of long mathematical proofs, systems begin to crumble: context drifts, dependencies get tangled, and attempting to fix a local error triggers a domino effect that destroys the entire logical chain. According to Yuanhe Zhang and a group of researchers from the University of Warwick and the RIKEN institute, the main challenge isn't the complexity of a single lemma, but managing hundreds of interdependent declarations. A model might tweak a definition to plug a hole in one place, silently invalidating all subsequent logic. The output is text that looks formally correct but is mathematically useless.
To solve this, researchers introduced LeanMarathon—a multi-agent system designed for the Lean 4 environment. At the center of the architecture is the "Blueprint," which serves as the single source of truth and the rigid skeleton of the proof. Instead of relying on a lucky prompt, the system coordinates four specialized agents:
Architect: Builds the overall structure and hierarchy of the proof.
Auditor: Conducts critical analysis and verification of logical transitions.
Executor: Handles the direct derivation and proof of specific lemmas.
Editor: Responsible for "repairs" and resolving local contradictions.
This orchestrator turns a fragile, multi-hour generation session into a series of recoverable transactions where every step is verified through adversarial checking.
Results and Efficiency
The approach was tested on four Erdős problems from recent scientific publications. In three autonomous runs, LeanMarathon successfully formalized all seven target theorems with a "nosorry" verdict—meaning no gaps remained in the proofs, covering a total of 258 lemmas. The development team believes this case proves that a reliable AI assistant in science requires more than just "smart" models; it needs rigid architectural frameworks capable of maintaining logic over the long haul.
We are seeing a clear trend: the era of talkative chatbots in serious business and science is ending, giving way to verifiable autonomous systems.
For tech leads, this is a critical signal: if your task requires 100% accuracy over a long duration, forget about scaling context windows or endless prompt tuning. The future lies in strict role separation between agents and the implementation of adversarial auditing. Industrial-grade AI reliability begins where trusting the model's word ends and total verification of every logical link begins.