The agent optimization loop and how we built it in Foundry Microsoft has built an optimization loop for AI agents that automates the process of fixing quality failures without causing regressions, addressing a key operational challenge teams face when running multiple production agents. The system, now available in Foundry, closes the gap between identifying what is wrong through traces and evaluations and shipping a better version without breaking other scenarios. This matters because manual prompt engineering and guess-and-check fixes become untenable as organizations scale from one to 20 agents across different domains. Improving agent quality at scale is one of the hardest operational problems teams face once agents are running in production. We’ve been working to close the gap between seeing what’s wrong and shipping a better version without breaking everything else . This post explains the thinking behind a new optimization loop for agents https://aka.ms/agent-optimizer-foundry-quickstart , what we learned building it, and how you can run it today. From craft and intuition to traces, evals, and a quality conundrum If you’re building production agents, you’ve probably walked a version of this path: You started with prompt engineering. Wrote the system instruction, iterated on it, got the agent to mostly work. You and the model, in a tight feedback loop of “try it, read the output, tweak the prompt.” This phase is craft. It’s intuition-driven, and it gets you surprisingly far. Then you added traces: OpenTelemetry, App Insights, whatever your stack uses. This is good engineering practice, but it’s also necessary. You couldn’t understand what the agent was actually doing without them. Now you can see the reasoning chain, the tool calls, the decisions. You have visibility. Then came evaluation. At first, it was vibes: reading traces, gut-checking whether the output felt right. Over time you got more rigorous. You defined metrics, set guardrails, and established quality bars. Maybe you built a scoring rubric across multiple dimensions policy compliance, cost-awareness, escalation accuracy . Now you can measure quality, not just feel it. You know your pass rate. You know which scenarios break. And then you hit the wall. Let’s say you ship a travel-approver agent. It calls three tools: lookup travel policy , check department budget , and get flight alternatives . It returns an approve, deny, or escalate decision. The first week looks clean. Then finance flags a $4,800 trip that was approved without VP sign-off. You pull the trace: The tools ran, the loop completed, and the output was confident. The agent just never called the budget-check tool. You find the gap in the instruction, so you add a rule about cost thresholds. Re-run the eval. That case passes. But now the emergency-travel override that used to work flawlessly starts escalating everything. You try a different wording. The emergency case recovers. Two other scenarios regress. You have the traces. You have the evals. You can see exactly what’s wrong and measure exactly how wrong it is. But fixing it without breaking something else? That’s where you’re stuck. The data you painstakingly collected just sits there while you manually guess-and-check your way through configuration changes. And it compounds. This might be tolerable for one agent. But if you’re operating five, 10, 20 agents across different domains, each with their own failure modes, their own evaluators, their own regression risks, the manual loop becomes untenable. You can’t individually nurse each agent through prompt revisions and hope nothing else breaks. You’re not debugging anymore. You’re searching. And you’re doing it without a map. Reframing the problem Most teams treat agent improvement like debugging: Find the broken thing, then fix it. But an agent that skips a budget check isn’t “broken” the way a null pointer validation is broken. Its instruction just doesn’t encode enough constraint for that scenario. There are dozens of possible instruction variants that might fix it, and most of them regress something else. In traditional software, when a test fails, you know what to fix. The stack trace points at a function. You patch the function, run the test suite, then confirm nothing else broke. With agents, quality failures could live in any of a dozen places: the system instruction, the model, a tool description, a skill definition. There’s no stack trace pointing at the broken line. The problem could be in any of those places, or several at once, and you can’t isolate it the way you’d isolate a bug. But here’s what you might not have noticed: You already have almost everything you need. Your traces contain the failure signal. Your evaluators contain the quality definition. What’s missing is the loop that connects them. The loop that goes from “I see what’s broken” to “here’s a better configuration, scored against everything, ready to ship.” We built a system that does for agent configurations what your CI pipeline does for code: Propose a change, score it against the full evaluation suite, and only promote it if quality holds across the board. If you’ve done hyperparameter tuning, this will feel familiar. The optimizer explores a configuration space the same way a sweep explores learning rates and architectures. The difference is that the search dimensions are instructions, skills, tool definitions, and model selection instead of numeric parameters. The optimization loop You already have the pieces: - An agent running in production model, instructions, skills, tools - Evaluators that score quality across multiple dimensions - Traces from real usage The optimizer takes all three as input and runs a four-step loop. Each step is something you’d otherwise grind through manually; the system handles the heavy lifting. 1. The optimizer generates candidates. It searches across instructions, models, skills, and tool definitions. These aren’t random mutations. A reflector model reads traces from your evaluations, identifies why the agent scored poorly, and proposes targeted changes more on the reflector shortly—it turned out to be the most important piece of the puzzle . 2. Candidates are scored and ranked. Same evaluators, same dataset, deterministic comparison. Every candidate is measured against the same bar your baseline was. Per-dimension scoring policy compliance, cost-awareness, routing accuracy means you can see exactly what improved and what regressed. 3. A developer reviews and decides. The loop isn’t completely autonomous. You look at what changed, why the optimizer proposed it, and whether the improvement is real. If it doesn’t look right, you reject and re-run optionally with updated evaluators or a different search configuration . If it passes your judgment, you approve. This is deliberate. Automation without oversight compounds errors. 4. The winner ships as the next version. Versioned, reversible, auditable. This updates your agent’s configuration: same model, same tools, better instructions. If the new version underperforms in production, you roll back. After shipping, production telemetry accumulates: user feedback, reviewer overrides, scenarios your eval set didn’t cover. This signal doesn’t flow directly into the optimizer. It flows into you: your decision to update evaluators, add new test cases, and trigger another optimization run. The optimizer works from your evaluations; production tells you what to measure next. There’s more to say about how the optimizer explores this space internally: the search techniques, the tradeoffs, how the reflector generates hypotheses. That’s beyond what we can cover here. But one finding from inside the optimizer is worth pulling out. What actually moves the needle The optimizer isn’t just randomly mutating prompts. The central piece is a reflector: a separate model whose only job is to read failing traces and reason about why the agent scored poorly. It then proposes targeted edits for the next round. Here’s what we found: The quality of that reflector, the model doing the diagnosis, has a disproportionate impact on outcomes. More so than the agent’s own model. More so than tuning other parameters in the search. This held across multiple agent types and domains. What does that mean concretely? Swapping to a stronger reflector model improved optimization results more than any other single change we could make. The agent could be running gpt-4o or gpt-4.1-mini . It didn’t matter as much as having a reflector that could clearly reason about why something went wrong and what to change about it. And here’s the implication for how you invest: The meta-cognition layer, the ability to reason about failures, matters more than anything else. Better diagnosis beats better execution. If you’re going to invest in one capability, invest in the quality of your failure analysis. The engineering behind the reflector how it reads traces, generates hypotheses, and avoids local optima is its own story. The travel-approver: A concrete run Let’s go back to our earlier travel-approver agent example. Here’s what one optimization run might produce: The winning candidate was a system-prompt rewrite. Same model, same tools, same skills. Just a better instruction. The optimizer added an explicit cost-threshold rule and an escalation ladder that the baseline lacked. The $4,800 trip that started this story? The optimized agent calls the budget check, sees the amount exceeds the $3,000 threshold, and routes to VP review. Same scenario, different outcome. The instruction now encodes the constraint explicitly. When to use this loop, and when to skip it This loop works better in specific situations. Here’s how to know if it fits yours. It’s a good fit when: - You have an agent in production with traces and evaluation data - Quality issues are cross-cutting: Fixing one thing breaks others - You’re operating at scale, across multiple agents or ongoing iteration cycles - The failure mode is at the configuration level: instructions, skills, tool definitions, model selection It’s probably not the right tool when: - Your agent is still in early development and you haven’t earned enough traces yet manual approaches like prompt engineering are still a good path forward - The problem is infrastructure: context window too small, tools return bad data, latency - You have one agent with one failure mode—in that case, just fix it manually - The task is reasoning-bound competition math, deep logic chains —here, you need a model upgrade, not instruction optimization Key takeaways Here are the four things we’d carry to any system doing this kind of work: Quality is a search problem, not a debugging problem. Define what good looks like, search the configuration space, and rank what works. Stop trying to fix one case at a time. Invest in diagnosis. The reflector the model that reasons about why things went wrong has more impact than any other single lever. Better failure analysis beats better execution. Evaluators are the ceiling. Your optimization is only as good as your quality definition. Start with generated approximations, refine with real data. The first version is never the last. Keep the human in the loop. The optimizer proposes; the developer decides. Automation without oversight compounds errors. How we built this in Microsoft Foundry We packaged this loop into Agent Optimizer https://aka.ms/agent-optimizer-foundry-quickstart inside of Foundry Agent Service https://azure.microsoft.com/en-us/products/ai-foundry/agent-service , available today through the azd CLI. Here’s what the travel-approver run looks like from your terminal: azd ai agent eval init generate dataset & evaluator from a one-paragraph description azd ai agent eval run score the current version baseline azd ai agent optimize search over candidates azd ai agent optimize apply --candidate