The era of mindlessly burning resources in massive compute clusters just to squeeze a semblance of logic out of language models is hitting a wall. While OpenAI’s o1 series and DeepSeek-R1 have proven that large-scale reinforcement learning (RL) can unlock AI's analytical capabilities, the price of entry remains prohibitive for most. The standard Group Relative Policy Optimization (GRPO) method, which became an industry benchmark following DeepSeek’s success, has proven inefficient in practice: researchers from the Kwaipilot (Kuaishou) team found it squanders samples and struggles when training a model across multiple disciplines simultaneously.

A Paradigm Shift in Efficiency

The arrival of SRPO (Two-Staged history-Resampling Policy Optimization) is more than just another technical report; it is a market signal that the paradigm is shifting. The team introduced the SRPO-Qwen-32B model, which matches the performance of DeepSeek-R1-Zero while requiring ten times fewer training steps. For any CTO, this represents a radical rethink of the total cost of ownership (TCO) for developing advanced reasoning systems. There is no longer a need to compete on GPU count if you can optimize for algorithmic efficiency.

Solving the Interdisciplinary Clash

The primary flaw in standard GRPO is what developers call "cross-domain optimization conflict." When you try to feed a model a mix of mathematical problems and programming code, the algorithm stalls. Mathematics requires long chains of thought (Long CoT), while code typically demands conciseness and direct answers. In a standard training cycle, these data types begin to compete: the model produces mediocre results, and its analytical depth plateaus.

SRPO marks the first time a model has achieved DeepSeek-R1-Zero performance levels across both math and coding domains simultaneously.

To break this deadlock, Kuaishou moved away from brute-force methods and implemented a two-stage training paradigm. This allows the model to integrate programming skills without losing the procedural thinking developed through mathematics. As a result, the system "grows smarter" across several disciplines without the usual trade-offs, where gains in one skill inevitably lead to degradation in another.

Efficiency Beyond the Batch

Beyond data handling, Kwaipilot engineers fixed a structural flaw in reward calculation. In standard GRPO, if different answers within a training group receive similar scores, the "advantage" metric for the gradient tends toward zero. You are essentially spending money and time on compute cycles that teach the model nothing. This is a financial sinkhole leading to premature plateaus: the model stops evolving because the data isn't challenging enough and the reward variance is too slim.

SRPO fixes this via a history-resampling mechanism, making every training step count toward the final result. Using the same base as DeepSeek (Qwen2.5-32B), the SRPO method scored 50 on the AIME24 benchmark and 41.6 on LiveCodeBench. These figures outperform DeepSeek-R1-Zero-32B despite using 90% fewer training resources.

This level of efficiency effectively democratizes the creation of reasoning models. What was once an elite club for tech giants with bottomless budgets is now becoming accessible to mid-sized technology businesses. Kuaishou’s results prove that a model's logical prowess is a matter of training architecture, not just the size of the GPU budget. For business, this means the cost of developing specialized, domain-oriented intelligent systems is about to drop by an order of magnitude. The focus is shifting from "who has the most servers" to "who has the most efficient pipeline."

Artificial IntelligenceMachine LearningLarge Language ModelsCost ReductionKuaishou