Functional tests, retrieval tests, and safety checks all passed. Full autonomy still hadn’t been earned.
I had an Azure AI agent that passed every test I wrote for it. Retrieval answers were grounded in the right documents. Function calls returned valid payloads. The test prompts came back handled correctly. The logs were clean. On paper, it was ready to run on its own.
I still refused to let it execute anything without a human signing off first.
That decision had nothing to do with not trusting the model. I’d spent weeks watching it behave well. The reason was older than any of the tooling involved: in an enterprise, a system that takes action is accountable for that action, and accountability is something you design before you switch autonomy on, not after. A green test run tells you the agent can do the work. It does not tell you the business is ready to hand over the keys.
This is a write-up of how I built the agent on Azure, the tests it passed, and the engineering case for keeping a person in the loop anyway. The scenario is fictional but close to the kind of thing many internal platform teams deal with. No real systems, customers, or numbers.
Picture an internal data platform. People constantly request access to things: a restricted Power BI workspace, a finance dataset, a higher permission level on a reporting tool. Every request lands in a queue, and someone on a review team has to read it, find the relevant access policy and the data classification standard, check whether the requester’s role qualifies, and then decide. Approve, reject, or ask for more information.
The manual version of this has the problems you’d expect. Reviews are slow because the right policy is buried in a wiki somewhere. Two reviewers reach different conclusions on near-identical requests. The audit trail is whatever someone remembered to type into a ticket. And the people doing the reviewing are senior, which means they’re expensive and they’d rather be doing almost anything else.
So the agent had a clear job. Read the incoming request. Pull the right policy and classification documents. Summarize what’s being asked. Flag anything missing, like a blank business justification or no listed manager. Recommend the next action. Draft the response. Prepare the actual workflow update, the API call that would grant the access or open an approval task. Then route the meaningful step to a human.
Notice the shape of that list. Everything up to “prepare the action” is analysis. The last step is the only one that changes the state of a real system. That split turned out to be the whole design.
The stack is ordinary Azure, wired with intent. A web app or a Microsoft Teams interface takes the request. Azure API Management sits in front as the entry point, handling auth and throttling. Behind it, an orchestrator coordinates the agent: it calls Azure OpenAI (or the Azure AI Foundry agent service) for reasoning, Azure AI Search for retrieval over a knowledge base stored in Blob Storage or ADLS Gen2, and Azure Functions for any tool calls into the business system API.
The part that matters is what happens after the agent produces a recommendation. Instead of letting it call a write API, the orchestrator drops the proposed action into a human approval queue. Azure Logic Apps drives the approval workflow, posting an Adaptive Card into Teams for a reviewer. Only after approval does a Function execute the prepared action against the business API. Every component emits to Azure Monitor and Application Insights, and every decision lands in an append-only audit store. Microsoft Entra ID handles identity, and Key Vault holds secrets, referenced through managed identity so nothing sensitive lives in code.
[Diagram 1: Azure AI Agent With Human Approval Architecture — embed the Excalidraw export here]
The one-sentence read of this diagram: the agent can think and look things up freely, but the only path to changing a real system runs through a human and a logged execution step.
Within that frame, the agent was capable. It classified each request into a type, retrieved the matching policy and classification documents through Azure AI Search, summarized the relevant context, and detected missing fields before they became someone else’s problem. It recommended a next best action and drafted a response in the reviewer’s voice. It called read-only APIs to enrich its view: who owns this dataset, what’s its classification, is the requester already in the right group.
What it could not do was act. Any write, granting access, changing a role, modifying a record, was only ever prepared. The agent would build the exact payload it thought should run, attach its reasoning, and send the whole package to the approval queue. The write tools were never bound to the model. You can’t accidentally fire a tool that isn’t in the toolset.
That sounds like a small distinction. It’s the difference between an assistant and an actor, and it’s the difference between an incident you can review and one you have to explain.
I’m a fan of testing AI systems like you’d test any other production software, plus a few categories that only matter for agents. Here’s what the suite covered.
Functional tests. Did it classify the request correctly and attach the right routing label? Did the summary come out in the expected format with the right fields extracted? A wrong classification quietly sends a case down the wrong policy path, so this got the most cases.
Retrieval tests. Did it pull the correct documents, answer only from approved sources, and cite the source IDs it used? When the context genuinely didn’t cover a request, did it refuse and route to a human instead of filling the gap with a confident guess? Ungrounded answers are the dangerous kind because they read exactly like grounded ones.
Tool and function tests. Did it call the right Azure Function with a valid JSON payload? Did it stay away from write APIs without approval? When an API returned an error, did it handle the failure cleanly rather than retrying into a mess?
Safety and policy tests. Did it keep sensitive fields out of its output? Did it reject unauthorized requests, avoid recommendations the policy didn’t support, and escalate genuinely ambiguous cases instead of picking a side?
Regression tests. I kept a library of past scenarios, around forty recorded cases, and replayed them after every prompt change and every document update to compare expected against actual output. Small prompt edits move behavior in ways you do not predict, and a document update can quietly change what retrieval surfaces.
Observability tests. Did each run log the prompt version, the retrieved document IDs, the tool calls, the eventual approval decision, the latency, and any failure reason? If you can’t see what the agent did, you can’t review it and you can’t improve it.
Test area What I validated Why it mattered Classification Correct request type and routing label A wrong label routes the case to the wrong policy Retrieval Grounded answers, cited source IDs, refusal on gaps A confident ungrounded answer is the worst kind Tool calls Right function, valid JSON, no unapproved writes A bad payload to a write API is the expensive failure Safety and policy No sensitive fields exposed, ambiguous cases escalated Technically correct is not the same as allowed Regression Replayed past cases after prompt and document changes Small edits shift behavior in ways you don’t expect Observability Logged prompt version, sources, tools, decision, latency What you can’t see, you can’t review or fix
It passed all of it. Consistently. So why not let it run?
Tests check the situations you thought of. That’s their whole nature and their whole limit. I wrote cases for the requests I’d seen and the failure modes I could imagine. Production is a stream of inputs nobody imagined, written by people in a hurry, about edge cases that don’t appear in any document.
Access review is also an interpretation problem, not a lookup problem. A request can be perfectly grounded in the right policy, formatted exactly to contract, and still wrong about intent, because the policy needs judgment that the text alone doesn’t carry. “Access to the quarterly numbers” means one thing in week one and something very different the day before earnings. A model that’s right on the words can still be wrong on the situation.
Then there’s the asymmetry of cost, which is the part that actually settles the argument. A delayed approval costs a reviewer a few minutes. A wrong access grant on a restricted finance dataset costs an incident, a containment effort, a review meeting, and possibly a conversation with someone in compliance who is not having a good day. When the downside is that lopsided, you do not trade minutes for that risk on the strength of a test suite.
Autonomy is earned with operational evidence, not assumed from a passing run. The agent had proven it could do the task. It had not yet proven, in production, over time, against the long tail, that it could be trusted to do the task alone. Those are different claims, and only one of them was supported.
The agent passed my tests, but the business process hadn’t earned full automation yet.
So the agent stops one move short of acting. For anything above low risk, it produces a complete decision package and hands it to a person.
The package the reviewer sees:
The reviewer can approve it as-is, reject it with a reason, edit the recommendation before approving, send it back for more information, or escalate to another team. The point is that a human makes the call with the agent’s full reasoning in front of them, not a black-box yes.
After approval, the machine takes over again. Azure Logic Apps triggers the final workflow. An Azure Function executes the approved payload against the business API. And the audit store records who approved it, when, what changed, and what the execution returned. The agent recommends; a person decides; the system executes and remembers. Four roles, cleanly separated.
[Diagram 2: Human-in-the-Loop Approval Flow — embed the Excalidraw export here]
Not every action deserves the same friction. Forcing human approval on a summary is theater, and it trains reviewers to rubber-stamp, which is worse than no review. The trick is matching the control to the risk.
I sort actions into three buckets.
Low risk is anything reversible and contained: summarizing a request, drafting a reply, a read-only lookup, tagging or categorizing, suggesting a next step. If it’s wrong, you notice and you fix it, and nothing in a real system changed. These can be automated early, with logging and the occasional spot check.
Medium risk touches a system but stays inside the building: updating a status, routing to another team, creating a task, sending a customer-facing draft. Wrong here is annoying and visible but rarely a disaster. Sampling-based review fits well, approve some percentage by hand and let the rest flow once you trust the pattern.
High risk is the irreversible or externally-felt set: granting access, changing a role, a policy exception, a financial decision, anything compliance cares about. This is where explicit human approval stays until you have strong operational evidence, and honestly, where some of it stays permanently because the cost of being wrong never drops.
Action type Example Automation level Approval needed Low risk Summarize, draft a reply, read-only lookup, tag a case Automate first No, just log and sample Medium risk Update status, route to a team, create a task, send a draft Automate with sampling Sometimes High risk Grant access, change a role, policy exception, financial or compliance action Keep manual Yes, until evidence says otherwise
The agent assigns the risk level itself as part of its output, and the orchestrator routes on it. The default, when the agent is unsure, is always to escalate. Uncertainty resolves toward a human, never toward action.
A few implementation details that carried the design, with code kept short and clearly simplified. None of this is production-complete, and there are no secrets in it.
The system prompt does the heavy lifting on boundaries. It tells the model to use only the provided sources, return a single JSON object, never call write tools, route high-risk actions to approval, cite source IDs, and refuse when the context doesn’t cover the request.
You are an access-request reviewer assistant for an internal platform.
Rules:- Use ONLY the sources provided in this request. No outside knowledge.- If the sources do not cover the request, set "approval_required": true and recommend routing to a human. Do not guess.- Never call write or update tools. You may only read and prepare actions.- Cite the source IDs you used in "retrieved_sources".- Classify the request and assign "risk_level" of low, medium, or high.- Any access grant, role change, or policy exception is high risk.- Respond with a single JSON object matching the output contract. No prose.
Retrieval is permission-aware. This is the part teams skip and regret. The Azure AI Search index carries metadata about who each document applies to, and the query filters on the requester’s identity and roles, resolved from Entra ID. The model never sees a document the requesting user shouldn’t surface. RAG without a permission filter is a data leak waiting for the right prompt.
The orchestrator ties it together. Authenticate, retrieve allowed context, generate, validate the output against the contract, check the risk, and route. If anything is off, the JSON doesn’t validate, the context is empty, the confidence is low, it bails to a human rather than improvising.
context = search.retrieve( # Azure AI Search query=req.text, allowed_for=user.roles, # permission filter on the index ) if not context.documents: return route_to_human(req, reason="no_grounding")
result = model.generate( # Azure OpenAI / AI Foundry system=SYSTEM_PROMPT, request=req, sources=context.documents, tools=READ_ONLY_TOOLS, # write tools are never bound )
if not is_valid(result, schema=OUTPUT_CONTRACT): return route_to_human(req, reason="schema_invalid")
if result.risk_level == "low" and result.confidence_band == "high": return automate(result) # draft / read-only path return create_approval(result) # medium and high go to a human
The agent’s output follows a strict JSON contract, which makes it cheap to validate, route, and audit. Anything that doesn’t match the schema is treated as a failure, not a maybe.
{ "request_id": "REQ-10045", "classification": "access_review", "summary": "User requested access to a restricted reporting workspace.", "retrieved_sources": ["policy-access-002", "workspace-standard-004"], "confidence_band": "high", "risk_level": "high", "approval_required": true, "recommended_action": "Route to data owner for approval.", "proposed_action_payload": { "route_to": "data_owner", "reason": "Restricted workspace access requires owner approval." }, "reviewer_options": ["approve", "reject", "revise", "escalate"]}
Execution is its own gate. The Azure Function that runs the action checks one thing first: did a human approve this? If the status isn’t approved, it does nothing and says so. Then it executes the prepared payload and writes the audit record. No secrets in code, just environment variables and managed identity.
import os, json, loggingimport azure.functions as func
php
def main(msg: func.QueueMessage) -> None: decision = json.loads(msg.get_body())
if decision.get("approval_status") != "approved": logging.info("Skipping %s, status=%s", decision["request_id"], decision.get("approval_status")) return
payload = decision["proposed_action_payload"] api_base = os.environ["BUSINESS_API_BASE"] # placeholder, no secrets
audit.write({ # append-only audit store "request_id": decision["request_id"], "approved_by": decision["approved_by"], "approved_at": decision["approved_at"], "executed": payload, "outcome": result.status, })
And you watch all of it. Each run emits custom events to Application Insights, which means a KQL query gives you the operational picture: how many requests, how many approvals versus rejections, failed tool calls, latency, and how much of the traffic is high risk. This is the data that eventually tells you whether a workflow has earned more autonomy.
customEvents| where timestamp > ago(7d)| where name in ("AgentRequest", "AgentApproval", "ToolCall")| summarize requests = countif(name == "AgentRequest"), approved = countif(tostring(customDimensions.decision) == "approved"), rejected = countif(tostring(customDimensions.decision) == "rejected"), failed_tools = countif(name == "ToolCall" and tostring(customDimensions.status) == "failed"), high_risk = countif(tostring(customDimensions.risk_level) == "high"), p95_latency_ms = percentile(toint(customDimensions.latency_ms), 95) by bin(timestamp, 1d)| order by timestamp asc
People treat responsible AI as a policy document. In practice it lives in architecture decisions, and this design happens to map cleanly onto Microsoft’s Responsible AI principles without anyone trying to make it.
Accountability is the named approver in every audit record. When an action runs, a person owns it, and you can point to them. Transparency is the reasoning package: the reviewer sees the sources and the logic, not a verdict. Reliability and safety is the fail-safe routing, the refusal on missing context, the agent that escalates when unsure instead of guessing. Privacy and security is the permission-aware retrieval, the least-privilege access, Key Vault, and managed identity, with sensitive fields kept out of model output. Fairness gets a real boost from consistency, because the same policy is applied the same way every time, which is more than you can say for a tired human reviewer at 5pm on a Friday. And there’s a feedback loop: every reject and every edit is signal about where the agent is weak, which feeds the next round of prompt, retrieval, and policy improvements. Microsoft Purview slots in here too when you need data governance and lineage across the documents the agent reads.
The separation between recommendation and execution is the spine of all of it. Keep those two apart and most of responsible AI follows. Collapse them and you’re writing incident reports.
Here’s the thing I’d most want a team to take away. Autonomy isn’t a switch, it’s a ladder, and you climb it as the evidence accumulates.
I think about it in stages. Assistant mode: the agent summarizes and recommends, the human does everything. Draft mode: the agent prepares the reply or payload, the human reviews and submits. Approval mode is where this project sat: the agent prepares the action, a human approves, the system executes. Sampled automation: the agent executes low-risk actions and humans review a sample. Conditional autonomy: the agent executes on its own, but only inside defined thresholds, and anything outside escalates. Narrow full autonomy is the top, and you only reach it for a single well-understood workflow, after enough monitoring, controls, and rollback capability that you’d bet on it.
Stage Agent role Human role Best fit 1. Assistant Summarizes and recommends Does everything New workflows, high uncertainty 2. Draft Prepares the reply or payload Reviews and submits Repetitive drafting with review 3. Approval Prepares the action Approves, then system executes Medium and high-risk actions 4. Sampled automation Executes low-risk actions Reviews a sample Low-risk, high-volume work 5. Conditional autonomy Executes within set thresholds Handles exceptions Well-understood actions with clear limits 6. Narrow full autonomy Executes end to end Monitors and audits One narrow workflow with strong rollback
What I learned over the build comes down to a few things. A passing test suite is not the same as production trust. Agents need boundaries the way any powerful system does. Human approval is not a weakness or a temporary crutch, it’s a control plane, and a good one is the reason you can move faster later. Trust grows through logs, review, feedback, and measured autonomy, in that order. And the safest architecture is not always the slowest one, because most of what the agent does is low risk and can run freely. You’re only gating the small set of moves that can actually hurt.
[Diagram 3: Progressive Autonomy Model — embed the Excalidraw export here]
The failure patterns here are consistent, and most of them come from measuring the wrong thing or moving too fast.
Measuring only answer accuracy and calling it done. Giving the agent write access before it’s earned it. Running RAG without permission filtering, which is how a helpful agent becomes a data exposure. Not logging the retrieved sources, so you can never reconstruct why it said what it said. Not versioning prompts, so a regression has no baseline to compare against. Mixing read tools and write tools in the same toolset instead of keeping them apart. Treating approval UX as an afterthought, which guarantees reviewers stop reading and start clicking. Forgetting rollback and exception handling until the day you need them. And ignoring reviewer feedback, which is the richest training signal you have and the cheapest to collect.
If you’re building something like this on Azure, here’s the gate I’d put in front of any action.
If you can’t check most of these, the agent isn’t ready to act on its own. It might be a fine assistant in the meantime.
The goal was never to slow AI down. The goal is to let it move at the right speed for the risk involved. An Azure AI agent that passes every test is probably ready to assist, recommend, and prepare actions, and that’s a lot of value on its own. Being ready to act alone is a separate, higher bar, and you clear it with operational evidence, not a green test run. Human approval is the practical bridge between the two: it lets a team ship the agent now, watch it earn trust, and open up autonomy deliberately, one workflow and one stage at a time.
I Built an Azure AI Agent That Passed Every Test. Here’s Why I Still Added a Human Approval Step. was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.