The current standard for training search agents is much like judging a football match based solely on the final score. Trajectory-level rewards praise the outcome of an entire sequence of actions, effectively granting credit to every intermediate step in advance. Researchers from the Beijing University of Posts and Telecommunications and Li Auto Inc. rightly point out that this approach fails to identify which specific action led to success and which was a waste of resources. In complex tasks where search chains span dozens of iterations, distributing value uniformly turns the system's logic into an opaque "black box."
To solve this, Yuchen Liu and his team introduced the Graph-Distance Contribution Reward (GDCR) method. Instead of burning through budgets on endless tree sampling and simulations to evaluate every decision, GDCR moves the task into the realm of latent graphs. Here, the knowledge base is represented as an Entity-Relation (ER) graph, and any task becomes a pathfinding mission to a specific "answer node." The value of each step is now measured by the physical reduction in distance to the goal. This transforms an agent's chaotic wandering into a mathematically precise movement toward a verifiable result.
Key takeaways from the new approach
Shifting from final outcome evaluation to measuring progress at every stage. Utilizing entity-relation graphs to provide mathematical justification for actions. Reducing computational costs by eliminating redundant sampling. Improving the interpretability of autonomous agent operations.
This mechanic was integrated into the Step Advantage Policy Optimization (SAPO) framework, which combines step-by-step graph advantages with final trajectory scoring. Benchmark tests across four datasets confirm that this hybrid approach radically boosts accuracy without the bloated operational expenses typical of traditional sampling methods. For business, this marks a long-awaited transition from "luck-based outcomes" to step-by-step accountability.
Judging AI solely by its final answer is a direct path to building expensive, unpredictable systems whose errors are impossible to diagnose. Shifting to graph metrics allows us to optimize the search logic itself, making autonomous workflows in RAG systems transparent and auditable.
You are no longer just getting an answer; you are getting a controlled, cost-effective route to it, where every step is justified by a decreasing distance to the truth rather than a random coincidence.