Master AI Agent Quality: Google's New Flywheel Skill

You’ve poured your heart into shipping an AI agent. It’s working beautifully, or so you think. You make a small tweak to a prompt, fixing a user complaint, and it looks perfect on the three examples you test. But then the nagging question hits: did I just inadvertently break ten other things?

This chasm between “looks good on a few examples” and “actually better in production” is the daily reality for anyone building AI agents. Most teams have evaluation cases, and most tweak prompts. Yet, surprisingly few connect these two crucial steps with enough rigor to truly know if a change improved performance or just gave a good “vibe.”

The Hidden Failures of AI Agents

The scariest agent failures aren’t the loud, obvious crashes. They’re the subtle ones – agents that seem to be working, delivering confident answers and well-structured plans, but quietly miss the user’s true objective. At Cloud Next ’26, we introduced the concept of agent quality as a three-phase flywheel: Build & Test → Ship & Monitor → Learn & Refine, along with its foundational components.

Today, we’re unveiling the developer-facing path to mastering this flywheel: a specialized skill your coding agent can install and autonomously manage for you. This methodology, with its core AutoRaters, is built on the same principles we use internally at Google to evaluate and enhance our own models and first-party agents, developed in close collaboration with Google DeepMind.

Introducing the Build & Test Skill for Agents

This new skill primarily focuses on the “Build & Test” phase – the rapid iteration loop – expanding it into five concrete stages. However, its utility isn’t limited to development; these same stages can be run against production traces, bringing more of the flywheel within reach over time. For initial passes, run them sequentially, then loop through stages 2-5 until your quality targets are met.

Many failing cases require several iterations before metrics genuinely improve, and this skill instills that crucial discipline. Crucially, the optimizer and evaluator remain separate: whatever proposes a fix (your coding agent, an automated optimizer, or even you) never grades it. Instead, the Gemini Enterprise Agent Platform GenAI evaluation service scores it independently.

An optimizer that grades its own work learns to game the metric rather than truly improving the agent.
This seemingly small architectural choice has a profound impact on maintaining objective evaluation.

Essentially, this skill provides a powerful combination of methodology and orchestration, all running within your coding agent. It intelligently selects the most appropriate metric for your goal, invokes the GenAI evaluation service, interprets the verdicts, proposes potential fixes, and provides clear before-and-after comparisons.

It ships in two convenient packages, both leveraging the same GenAI evaluation service. You can choose the one that best fits your development stack:

For ADK and agents-cli users: Install via npx skills add https://github.com/google/agents-cli --skill google-agents-cli-eval
For SDK-driven agents (any framework): Install via npx skills add https://github.com/google/skills --skill agent-platform-eval-flywheel

A Real-World Example: Fixing a Travel Concierge Agent

Let’s walk through a single cycle on a real-world agent. What’s remarkable is how little the developer needs to interact with the evaluation CLI or even name specific metrics. They simply install the skill, articulate a concern in plain language, approve a plan, and review the results. The skill handles the “how,” and its most intelligent decision is determining which metric can even detect the stated failure.

The agent in question is “travel-concierge,” an ADK multi-agent trip planner from google/adk-samples. It manages a working itinerary in session state, which, in this scenario, led to a subtle, specific failure. The developer’s initial prompt to the skill was concise: “I’m worried that the agent does not honor revisions users make mid-conversation, e.g., ‘Change party size to 4’.”

This is the entire interface – no flags, no metric names. The skill’s job is to translate this goal into the correct evaluation. It first analyzes the agent’s code, then proposes a plan. It leverages two built-in multi-turn AutoRaters and, crucially, adds a custom rubric tailored to precisely pinpoint the “before” and “after” of the revision-honoring behavior described by the developer.

Upon approval, the skill orchestrates the User Simulator to synthesize diverse scenarios, grading the traces with both built-in raters and its custom “revision_honored” rubric. Only a handful of cases failed, so instead of advanced clustering, it directly read the verdicts. The key finding: the agent often performed the right action internally (correct value stored, right tool called) but then contradicted itself by echoing the stale value in its final message to the user.

This highlights the “looks like it’s working” failure in miniature. Nothing crashed, the plan seemed fine, and the agent sounded correct, but the user received incorrect information. The root cause: the agent’s instruction didn’t tell it to cross-reference its final response with the user’s most recent message before sending.

The Solution and the Rerun

The fix was approved: a targeted instruction update for the root agent, adding three sentences mandating reconciliation of its final response with the latest user revision. The skill then automatically re-ran the exact same evaluation. This simple, yet precisely targeted, intervention led to a dramatic improvement, moving the “ignored revision” metric from 21% down to a mere 5%.

This illustrates the power of combining a stable, custom measure (like the “revision_honored” rubric) with the adaptive built-in AutoRaters, which provide a broader health signal. While the built-ins are adaptive, a custom rubric is essential for precisely isolating and quantifying the specific behavior you want to improve.

From Dev to Production: A Continuous Flywheel

While the User Simulator is invaluable for development when real usage is scarce, as an agent matures and serves live traffic, production sessions become the most valuable input. Each one represents a genuine request – from a user, another agent, or an upstream service – and each failure becomes a ready-made test case for the next cycle. The same evaluation stages are then fed by real usage instead of simulation.

The same skill can be seamlessly pointed at production traces. It skips inference and grades them in place with the same AutoRaters. When online monitors detect quality drifts in live traffic, you can feed those failing traces to the same skill, initiating the familiar eval-fix loop. This creates a powerful, continuous flywheel: on-demand in development, continuous in production, all graded by the same AutoRaters.

Today, the skill runs the inner loop on demand and grades production traces as directed. The future direction is to empower it to drive more of the outer loop autonomously, watching monitors, surfacing regressions, and proposing fixes as your traffic patterns evolve. Your agent doesn’t need to be perfect; it simply needs to be improvable.

Source: Google Developers Blog

Kristine Vior

With a deep passion for the intersection of technology and digital media, Kristine leads the editorial vision of HubNextera News. Her expertise lies in deciphering technical roadmaps and translating them into comprehensive news reports for a global audience. Every article is reviewed by Kristine to ensure it meets our standards for original perspective and technical depth.