Modern LLM agents demonstrate impressive multitasking capabilities, yet they suffer from a specific form of "strategic amnesia." In traditional Reinforcement Learning (RL), models progress based on environmental rewards but completely ignore the accumulation of universal strategies. For these agents, every new task is a blank slate; a lack of modularity forces them to reinvent the wheel instead of drawing from a library of ready-made skills. Even Anthropic’s Skill Creator approach, which automates skill generation, remains essentially static—it relies on human intervention and exists in isolation from the evolution of the core policy. This lack of synchronization creates a risk where skill creation and policy optimization drift apart, causing legacy methods to interfere with the agent's updated "brain."
The RL-in-the-Loop Architecture
Researchers have introduced ReSkill, a framework that embeds the skill-creation process directly into the RL training cycle. The architecture leverages the group structure of the Group Relative Policy Optimization (GRPO) algorithm. Instead of simply running scenarios, ReSkill tests competing versions of the same skill within a single group of rollouts. This allows for a direct, real-time performance comparison of different iterations. Each rollout serves a triple purpose: providing gradients for policy optimization, acting as a diagnostic tool for error detection, and serving as a proving ground for new skills.
ReSkill synchronizes skill evolution with policy training by testing competing versions within a single training cycle.
Unlike "blind delivery" methods where external knowledge is forced upon a model without proper verification, ReSkill employs an assertion-based diagnostic mechanism. The system analyzes past failures and suggests targeted trigger-based fixes. To manage this growing repertoire of abilities, the framework uses Thompson sampling with adaptive discounting. This mathematical approach balances the exploration of new solutions with the exploitation of proven ones, ensuring the skill library expands at a pace that supports, rather than hinders, the development of the base policy.
Performance Gains and Constraints
Methodologically, ReSkill has proven superior to classical memory-based RL methods across multiple domains. The most significant performance spikes were recorded in tasks the agent had never encountered before. This confirms that these skills are truly transferable, not just overfitted to a training set. The lifecycle of a skill in ReSkill is dynamic: they are automatically created, tested, refined, and ruthlessly purged if they cease to provide value. This co-evolution prevents "strategic drift"—a situation where an agent's knowledge base stagnates while its decision-making logic advances.
The system utilizes GRPO to minimize computational overhead during training. Thompson sampling ensures a mathematical balance between trying new tactics and using reliable ones. Skills are treated as dynamic assets that are deleted if they no longer contribute to the reward goal.
However, autonomy comes at a cost. While GRPO adds only marginal overhead, the iterative nature of Thompson sampling and the requirement for group rollouts put pressure on computational resources. The system's effectiveness is also tied to the quality of environment rewards; ReSkill thrives where success is clearly and transparently measurable. This represents a major paradigm shift: we are moving from manual knowledge management to systems where the agent is responsible not only for the task at hand but also for writing (and revising) its own operating manual.
The transition from static libraries to self-cleaning skill ecosystems is a necessary step toward autonomous agents capable of mastering heterogeneous environments without a handler. The system's capacity for self-diagnosis via failure analysis makes it viable in the long term, even if it remains dependent on the density of feedback signals from the environment. Ultimately, ReSkill offers an architectural fix for scalability, turning every failure into a building block for future success rather than a reason to reset.