RAGEN: New AI training method launched by ex-DeepSeeker team and partners

RAGEN is a new system for training reliable AI agents developed by researchers from Northwestern, Microsoft, Stanford, and University of Washington
The team includes Zihan Wang, a former DeepSeeker researcher now completing his PhD at Northwestern
RAGEN tackles the "Echo Trap" problem where AI agents lose reasoning abilities during reinforcement learning
StarPO framework focuses on entire decision-making processes rather than single responses
The system is now available as an open-source project on GitHub
RAGEN was tested on three symbolic environments: Bandit, Sokoban, and Frozen Lake
StarPO-S adds stabilization features that prevent training collapse

Introduction to RAGEN and the Team Behind It

So 2025 was posed to be the year when AI agents would take over the business world. But that ain't exactly happening yet. Most AI agents are stuck in testing phases and can't handle real-world tasks reliably enough.

But good news is here! A team of smart folks from Northwestern University, Microsoft, Stanford, and the University of Washington have made something called RAGEN. It's a new way to train AI agents that could make them work better for companies. The team includes Zihan Wang, who used to work at DeepSeek and is now finishing his computer science PhD at Northwestern.

What makes RAGEN different? It focuses on helping AI handle back-and-forth conversations where the computer needs to remember stuff, think on its feet, and deal with surprises. This is exactly what businesses need if they want AI to do actual useful work instead of just answering simple questions.

As more companies experiment with AI solutions like Google's Gemini 2.5 Flash, new training methods like RAGEN could help overcome the limitations that keep AI agents from being truly useful in business settings.

The Problem RAGEN Solves: The "Echo Trap" in AI Agent Training

Have you ever wondered why AI assistants sometimes get dumber over time? The RAGEN team did, and they found something they call the "Echo Trap."

Here's what happens: when you first train an AI with reinforcement learning (that's teaching by giving rewards), it starts out making thoughtful, well-reasoned responses. But after a while, it finds shortcuts. The AI discovers that certain words or patterns get rewards more often, so it keeps using them over and over, even when they don't make sense.

Wang explained this in a post that lots of people shared online. He showed that you can actually measure when this happens - the AI's thinking traces start to disappear, reward variance falls off a cliff, and gradient spikes appear in the training data.

This "Echo Trap" is a big reason why AI agents aren't ready for serious business use yet. They might look smart at first, but they get stuck in bad habits that make them unreliable. With economic warnings on many business leaders' minds, companies need AI systems they can truly depend on.

How StarPO Works: State-Thinking-Actions-Reward Policy Optimization

The RAGEN team built something called StarPO, which stands for State-Thinking-Actions-Reward Policy Optimization. That's a mouthful! But it's basically a new way to teach AI agents that helps them learn from experience rather than just memorizing stuff.

StarPO works in two main phases that switch back and forth:

Rollout phase: The AI creates complete conversation sequences, including its thinking steps. It's like watching someone work through a math problem while talking out loud about each step.
Update phase: The system looks at how well those conversations worked and adjusts the AI's training using something called "normalized cumulative rewards."

This approach is more stable than other ways of training AI. It also makes it easier to see how the AI is making decisions because you can follow its thinking process.

The team used AI models called Qwen 1.5 and Qwen 2.5 from Alibaba for their tests. They picked these because anyone can access the full models (they're "open weights") and they're good at following instructions. This made it easier for the researchers to compare results fairly across different tests.

As Nvidia faces challenges from recent US restrictions, open research like RAGEN becomes even more important for advancing AI technology globally.

Testing Environments: Bandit, Sokoban, and Frozen Lake

To test if RAGEN actually works, the team created three special environments. These aren't real-world business problems but simplified tests that help show if the AI is thinking well.

Bandit: This is a one-turn game with chance involved. The AI has to weigh risks and rewards using symbols. For example, the AI might need to choose between a "Dragon arm" and a "Phoenix arm" based on clues about which gives better rewards.

Sokoban: This is a multi-turn puzzle game where moves can't be taken back. The AI has to plan several steps ahead and avoid getting stuck.

Frozen Lake: Another multi-turn game, but with random elements. The AI has to adapt its plan as things change unexpectedly.

These test environments aren't exactly like real business problems, but they help show if the AI can make smart decisions based on reasoning rather than just guessing or using shortcuts.

The Bandit test is particularly interesting. Instead of telling the AI the exact chances of winning with each choice, they use symbols and metaphors. For example, the Dragon might represent "strength" and the Phoenix might represent "hope." The AI has to figure out what these symbols mean for its chances of success.

As businesses face market reactions to tariffs, they need AI systems that can adapt to changing economic conditions - similar to how RAGEN agents learn to adapt in the Frozen Lake environment.

StarPO-S: The Stabilized Training Framework

The original StarPO framework worked okay, but the AI still eventually fell into the Echo Trap. So the team created StarPO-S, which adds three important improvements:

Uncertainty-based rollout filtering: This focuses training on situations where the AI isn't sure what to do. It's like a teacher spending more time on problems a student finds difficult.
KL penalty removal: This technical change gives the AI more freedom to try new approaches instead of sticking too close to what it did before.
Asymmetric PPO clipping: This boosts the importance of successful attempts more than it punishes failures. It's like giving extra credit for great work but being gentle with mistakes.

When they put these changes together, something amazing happened - the Echo Trap problem got much better or disappeared completely across all three test environments. Wang summed it up simply: "StarPO-S… works across all 3 tasks. Relieves collapse. Better reward."

This breakthrough could be critical as businesses look to deploy AI in challenging economic times. Just as Fed Chair Powell issues warnings about economic uncertainties, companies need AI systems that remain stable and reliable even under pressure.

What Makes Good Agentic AI Models: Key Dimensions

The RAGEN team didn't just build a better training system - they figured out what makes AI agents good at learning in the first place. They found three important factors:

Task diversity: AI needs to see lots of different starting situations. It's like how people learn better when they practice with many different examples, not just the same problem over and over.

Interaction granularity: Allowing the AI to take multiple small actions per turn helps it plan better. Instead of making one big choice, it can break problems down into smaller steps.

Rollout freshness: The training data needs to match what the AI is currently doing. Using old examples from earlier versions of the AI can confuse the learning process.

The team also created a demo website on GitHub that shows the full conversations between the AI and its training environment. You can see not just what the AI did, but how it thought about each step. For example, when solving a math problem, you might see the AI first think about "I need to isolate the variable" before giving an answer like "x = 5."

This kind of transparency could help address concerns raised in cases like Mark Zuckerberg's FTC trial, where algorithmic transparency has become a major regulatory focus.

Real-World Applications and Future of RAGEN

RAGEN isn't quite ready for companies to use right away. The paper shows it works well on simple test problems, but there are still questions about how it would handle real business tasks like processing invoices or helping customers.

One big challenge is that the training still eventually breaks down over very long periods, even with the improvements in StarPO-S. The researchers are still working on ways to keep the AI's reasoning abilities strong over time.

The good news is that RAGEN is now available as an open-source project. Anyone can download it from GitHub at https://github.com/RAGEN-AI/RAGEN. However, at the time the article was written, the project didn't have a clear license, which might limit how people can use or share it.

RAGEN represents an important step toward AI that can truly think, plan, and learn from its own actions. As businesses face disruptions like communication platform outages, having more reliable AI agents could provide crucial backup systems and support.

For companies watching market reactions to tariffs and other economic changes, RAGEN-style AI could eventually help analyze complex scenarios and provide more nuanced recommendations than current systems.

Frequently Asked Questions

What does RAGEN stand for?

RAGEN doesn't appear to be an acronym. It's the name of the system for training and evaluating AI agents developed by the research team from Northwestern, Microsoft, Stanford, and University of Washington.

Who created RAGEN?

RAGEN was created by a collaborative team from Northwestern University, Microsoft, Stanford, and the University of Washington. A key member is Zihan Wang, a former DeepSeek researcher who's currently completing a computer science PhD at Northwestern.

What problem does RAGEN solve?

RAGEN addresses the "Echo Trap" problem where AI agents trained with reinforcement learning initially generate well-reasoned responses but eventually develop shortcuts and repetitive behaviors that degrade performance.

What is StarPO?

StarPO stands for State-Thinking-Actions-Reward Policy Optimization. It's the custom reinforcement learning framework that RAGEN is built on, which focuses on entire decision-making trajectories rather than just one-step responses.

How does StarPO-S improve on the original StarPO?

StarPO-S adds three key improvements: uncertainty-based rollout filtering, KL penalty removal, and asymmetric PPO clipping. These changes help delay or eliminate training collapse and improve performance across different tasks.

Can businesses use RAGEN right now?

While RAGEN is available as an open-source project on GitHub, there are still questions about how well it would transfer to real-world business applications beyond the symbolic test environments used in the research.

What AI models did the researchers use to test RAGEN?

The researchers implemented and tested the framework using fine-tuned variants of Alibaba's Qwen models, including Qwen 1.5 and Qwen 2.5. These models were chosen for their open weights and robust instruction-following capabilities.

How might RAGEN affect the future of AI agents?

RAGEN represents an important step toward developing AI agents that can maintain their reasoning abilities during extended training and use. This could eventually lead to more reliable AI assistants for business applications, similar to how global leaders are bringing innovation to various fields.