
Advanced Reinforcement Learning Models
The field of artificial intelligence continues to evolve at an unprecedented pace, with reinforcement learning (RL) at its forefront. Recent advancements have demonstrated the transformative potential of large-scale RL, particularly in enhancing the capabilities of large language models (LLMs).
Notable successes, such as OpenAI’s o1 series and DeepSeek-R1, have showcased the ability of these models to develop sophisticated reasoning skills. However, the intricate methodologies underlying these achievements often remain shrouded in technical complexity, leaving room for further exploration and innovation, especially regarding advanced AI techniques, particularly in large language models in the context of advanced AI techniques. While much of the AI community’s focus has been on mathematical reasoning, the challenge of cross-domain generalization remains largely uncharted territory.
Traditional Reinforcement Learning from Preference Optimization (GRPO) approaches encounter significant hurdles, including performance bottlenecks and inefficient sample utilization, particularly when dealing with mixed-domain datasets. These obstacles hinder the effective scaling of RL methods for LLMs, necessitating new strategies to overcome these limitations.
In response to these challenges, the Kwaipilot team at Kuaishou has introduced a groundbreaking reinforcement learning framework known as Two-Staged history-Resampling Policy Optimization (SRPO), particularly in advanced AI techniques, particularly in large language models. This innovative approach systematically addresses the training challenges plaguing existing models, offering a fresh perspective on optimizing RL methodologies. The team has publicly shared a comprehensive technical report detailing their training method and open-sourced the SRPO-Qwen – 32B model, marking a significant milestone in AI development.
Reinforcement Learning Bottlenecks
During their initial explorations, the Kwaipilot team quickly identified bottlenecks within the standard GRPO algorithm that impeded the model’s performance. Key issues included: ① Cross-Domain Optimization Conflicts: Mathematical and code data pose distinct challenges, with math problems requiring detailed reasoning trajectories and code data leaning towards direct outputs.
Mixing these data types led to conflicts, resulting in suboptimal performance in both domains.
② Reduced Training Efficiency: GRPO relies on the variance of non-zero rewards within sampled groups to calculate advantages in the context of reinforcement learning in the context of advanced AI techniques in the context of large language models. When groups yield identical rewards, effective gradient contributions diminish, drastically reducing training efficiency.
③ Premature Performance Saturation: Early performance plateaus and reward saturation were observed, attributed to insufficient data quality. A lack of complexity and diversity in training data hindered the model’s ability to develop the in-depth reasoning required for challenging problems, especially regarding reinforcement learning in the context of advanced AI techniques, especially regarding large language models.
To overcome these limitations, the Kwaipilot team devised a two-stage training paradigm, designed to enhance the model’s reasoning capabilities across mathematical and code domains.
enhanced AI capabilities training
The Kwaipilot team’s two-stage training approach effectively addresses the inherent response length conflicts between mathematical and code domains. This paradigm consists of: ① Stage 1: Eliciting Reasoning Abilities: Focusing exclusively on challenging mathematical data, this stage incentivizes the model to develop reflective pausing, backtracking, and step-by – step decomposition skills.
By honing these capabilities, the model gains a strong foundation in reasoning, including reinforcement learning applications, especially regarding advanced AI techniques in the context of large language models, particularly in reinforcement learning, especially regarding advanced AI techniques in the context of large language models.
② Stage 2: Skill Integration: In this phase, code data is introduced to build upon the reasoning foundation established in Stage 1. This integration enhances coding abilities while progressively strengthening procedural thinking, recursion, and tool-calling capabilities.
Through this methodical training strategy, the model achieves superior results in both mathematical and programming domains, showcasing its potential to seamlessly integrate reasoning skills across diverse tasks.

training data analysis strategies
An in-depth analysis of different training data strategies revealed significant insights into response length and benchmark performance: ① Mixed Training: Models trained on a combination of math and code data exhibited limited growth in response length and poor benchmark performance. While math problems elicited reasoning patterns, code problems often resulted in short, direct responses with minimal preliminary analysis.
② Math-Only Training: Focusing solely on mathematical data led to a stable increase in response length and excellent performance on math benchmarks. This approach fostered strong, generalizable reasoning abilities, enabling the model to attempt detailed, step-by – step reasoning on programming tasks, especially regarding reinforcement learning, including advanced AI techniques applications, including large language models applications, including reinforcement learning applications, including advanced AI techniques applications in the context of large language models.
③ Code-Only Training: While showing improved performance on code benchmarks, this approach yielded minimal development of explicit reasoning behavior, with responses noticeably shorter compared to math-only training. Code solutions were often generated directly, lacking substantial step-by – step reasoning.
④ Staged Training: The two-stage training approach proposed by the Kwaipilot team demonstrated superior results across mathematical and programming domains, especially regarding reinforcement learning, particularly in advanced AI techniques, especially regarding large language models. The model consistently generated detailed reasoning for math problems and structured reasoning patterns for programming tasks.
Notably, it exhibited complex behaviors, such as spontaneously utilizing code to assist in mathematical reasoning.

History Resampling in GRPO Training
During the mid-to – late stages of training, the Kwaipilot team observed inefficiencies in the GRPO algorithm, notably when nearly 50% of sampled groups within a batch produced identical rewards. This issue often arose when the model consistently succeeded on easier problems, leading to minimal reward variance and ineffective gradient updates.
To address this, they introduced History Resampling, a technique designed to improve the quality of the gradient signal, including reinforcement learning applications, especially regarding advanced AI techniques in the context of large language models. By recording the reward outcomes of all rollouts within each epoch, they reconstructed the dataset for the next epoch based on specific criteria: ① Filtering Overly Simple Samples: Samples where all rollouts resulted in correct answers were excluded, providing no informative signal for policy improvement.
② Retaining Informative Samples: Samples with diverse outcomes or all incorrect outcomes were retained, generating positive reward variance and ensuring effective gradient signals in the context of reinforcement learning, especially regarding advanced AI techniques, particularly in large language models. Compared to the Dynamic Sampling method proposed in DAPO, History Resampling significantly improved computational efficiency, resulting in more stable response length growth and enhancing the overall training process.

Data Curation and Quality Techniques
The Kwaipilot team’s meticulous approach to data curation played a pivotal role in their success. They applied heuristic rules to filter out irrelevant URLs and formatting noise from publicly available Code & Math datasets.
Ensuring the completeness of core fields (question and answer ground truth) was crucial in maintaining data quality, particularly in reinforcement learning, especially regarding advanced AI techniques in the context of large language models in the context of reinforcement learning, especially regarding advanced AI techniques, particularly in large language models. For mathematical data, they followed the data cleaning approach of PRIME, excluding multi-part questions, pure proof-based problems, and those requiring image or table understanding. In the context of code data, problems dependent on specific environments, file I/O, or network interactions were excluded, focusing on algorithmic logic.
Before data ingestion, they conducted correctness verification for both math and code problems, discarding those with incorrect or ambiguous solutions. Subsequently, they assessed the difficulty of each problem, categorizing them into easy, medium, and hard levels based on their pass rate (Pass@k).
SRPO training advanced AI techniques
The experimental results obtained using the SRPO method highlight the transformative potential of this innovative approach. Key observations included: ① Training Process: The complete reward curve and response length curve during SRPO training illustrated a steady increase in reward and response length.
After the initial reward growth plateaued, the training transitioned into the second stage. Integrating code data did not significantly increase response length, but benchmark results indicated continuous improvement in both mathematical and coding abilities, especially regarding reinforcement learning, especially regarding advanced AI techniques, particularly in large language models, especially regarding reinforcement learning in the context of advanced AI techniques, including large language models applications.
② Reasoning Behaviors: The Kwaipilot team identified three reflective patterns: recheck, hesitation, and exploration. They observed a gradual increase in the model’s self-reflection, correction, and backtracking, indicating the emergence of a “self-verification” ability akin to human cognitive processes.
Interestingly, the model spontaneously used program code for verification when solving mathematical problems, including reinforcement learning applications, especially regarding advanced AI techniques, particularly in large language models. It would provide a solution process through mathematical reasoning and proactively write program code to verify the correctness of the solution. This ability to leverage procedural thinking for self-correction and multiple attempts further underscores the model’s adaptability and sophistication.
SRPO framework reinforcement learning
The introduction of the Two-Staged history-Resampling Policy Optimization (SRPO) framework marks a significant advancement in the field of reinforcement learning. By addressing the challenges associated with traditional GRPO methods and implementing a meticulous data curation process, the Kwaipilot team has paved the way for more efficient and effective AI training methodologies.
As AI continues to evolve, the insights gained from the SRPO approach offer valuable lessons for future developments, especially regarding reinforcement learning, including advanced AI techniques applications, particularly in large language models. By fostering a deeper understanding of cross-domain generalization and optimizing training strategies, we can unlock new possibilities for AI models, equipping them with the reasoning capabilities required to tackle increasingly complex tasks. For those interested in exploring this innovative approach further, the SRPO-Qwen – 32B model is available for experimentation on platforms like HuggingFace, especially regarding advanced AI techniques, including large language models applications.
The potential applications of SRPO extend beyond the realms of mathematics and programming, offering exciting opportunities for cross-domain AI advancements in a wide range of fields.