When Azure AI Studio became Azure AI Foundry in late 2024, I thought everyone I talked to believed it was a case of simply rebranding an existing product with more compelling messaging.
But three months of using Foundry in a production environment for a BFSI reconciliation project proved otherwise. Foundry represents a platform change — a surface for managing models, hosting agents, orchestrating activities, evaluating outputs, securing everything, and connecting with enterprise data sources which used to require different services and integration through separate SDKs.
Nonetheless, many people do not get it. Architects hoping for a full AI platform get a great serving platform with a need for additional architecture — validation logic, governance-grade auditing, human intervention processes — to make it production-ready. In this article, I will describe our architecture and what each Foundry component does.
Azure AI Foundry is not a single tool. It is a platform of ten distinct capabilities. Most teams use two or three of them. The ones getting the most value know all ten — and have made deliberate choices about which to use for each job in their architecture.
Before covering individual components, the organisational structure matters.
One of the biggest mistakes teams make when adopting Azure AI Foundry is treating everything as a single workspace.
That works for experimentation. It breaks in enterprise environments.
The better mental model is: Azure AI Foundry has two levels:
A Hub acts as the centralized governance and infrastructure boundary.
This is where platform teams establish shared capabilities that multiple AI applications can consume.
The top-level resource. Shared compute, shared connections to external data sources, shared security and governance settings. One Hub serves multiple teams and use cases. Think of it as the enterprise AI platform layer — set up once by the platform team, consumed by multiple product teams.
When we first implemented Azure AI Foundry in BFSI reconciliation, the Hub quickly revealed itself as more than a workspace container. It became the governance nucleus — the place where compliance, audit logging, and shared model access are enforced.
Unlike Projects, which are execution sandboxes, the Hub is the control plane:
In practice, this meant our reconciliation and KYC agents could run in separate Projects but still inherit common guardrails from the Hub — ensuring RBI/PMLA compliance without duplicating controls.
A Project represents an individual AI initiative or application. This structure matters for enterprise architects because it maps directly to how organisations want to govern AI: centralised platform capabilities, decentralised product development. The Hub owner controls what data connections and compute tiers are available. The project team controls how they use them.
Projects inherit platform capabilities from the Hub while keeping implementation isolated.
Examples:
Project: KYC Copilot
Owns:
[Your model deployment layer] The Model Catalogue is not a list of available models. It is the contract between AI innovation and enterprise governance.
Applications should never depend on a model. They should depend on capabilities delivered through the platform.
The model catalogue gives access to hundreds of foundation models from OpenAI, Meta (Llama), Mistral, Cohere, Microsoft Phi, and others — all deployable through a single interface without managing separate API subscriptions or infrastructure for each provider. Models are organised by task type, benchmark scores, and licensing terms, making it practical to evaluate multiple models against the same use case before committing to a deployment.
What makes it architecturally valuable is the managed endpoint layer on top. We deploy a model, configure compute, set a version label, and get an endpoint URL. When a new model version is available, we deploy it alongside the existing one, split traffic — 10% to the new version, 90% to the existing — monitor performance metrics for both, and promote or roll back with a configuration change rather than a code deployment.
In a project context, this matters for regulatory traceability as much as operational convenience. Every model serving credit decisions, fraud scores, or KYC verifications needs a deployment record: which model version, deployed when, serving what traffic percentage, against what performance baseline. The model catalogue provides this as built-in infrastructure.
Model catalogue endpoints add approximately 5–10ms latency versus calling Azure OpenAI directly. For most enterprise workloads this is immaterial. For sub-50ms latency requirements — real-time fraud scoring at payment terminals, pricing decisions in the POS critical path — route latency-critical traffic directly to Azure OpenAI endpoints and use the catalogue for model management and evaluation only.
[Orchestration for linear and evaluation workflows]
In production systems, a prompt is not just a piece of text but becomes something much bigger.
It becomes:
Prompt Flow is best understood as: A workflow engine for AI interactions
Instead of writing one large prompt and sending it to a model, you define a sequence of connected steps.
Each step can:
Prompt Flow is Foundry’s workflow orchestration framework — a way to define chains of LLM calls, tool invocations, data retrievals, and conditional logic as a repeatable, testable, deployable flow. It supports both a visual graph editor (drag-and-drop nodes in the Foundry portal) and a code-based YAML definition that can be version-controlled and deployed via CI/CD.
Where Prompt Flow genuinely excels is in two specific jobs: linear document processing workflows and evaluation pipelines. A KYC flow that extracts fields from an identity document, validates against a schema, checks sanctions lists, and returns a structured verification result is a natural Prompt Flow use case — sequential, deterministic, testable against a labelled dataset. An evaluation pipeline that runs groundedness, relevance, and safety checks against every model update before deployment is where Prompt Flow delivers its highest production value.
[Managed agent hosting and execution] Enterprise systems require more than AI demos.
They need AI that can:
Azure AI Agent Service provides the managed infrastructure for running AI agents — handling conversation thread lifecycle, tool registration, function-calling protocol, and agent state without requiring you to build and maintain that infrastructure yourself. An agent is defined by its model, its instructions, and its tool catalogue. The service manages the reasoning loop: the agent reasons, selects a tool, the service invokes it and returns the result, the agent reasons again toward the goal.
The Agent Service eliminates significant boilerplate from agentic system development. Thread management — maintaining the message history and working context for each agent run — is handled automatically. Tool registration uses a standard function definition schema that the service translates into the model’s function-calling format. Retry logic for transient tool failures is configurable at the service level rather than implemented in every tool.
The Agent Service manages thread state in its own store. For regulated financial systems where the agent’s full reasoning chain must be part of the governance audit trail, mirror the thread history to your own Cosmos DB instance with append-only access policies. The Agent Service store is operational — it is not a governance record.
[Automated quality gates for every deployment]
Evaluation is the mechanism that converts subjective AI behavior into measurable engineering quality.
Unlike traditional software, AI systems rarely fail with exceptions.
They fail with:
The purpose is to continuously measure, validate, and improve the performance, safety, and compliance of your enterprise agents.
Core Components:
2. Graders
3. Drift Detection
4. Evaluation Pipelines
5. Governance Dashboard
The evaluation framework is the most underused capability in Azure AI Foundry and the one that makes the most difference to production quality. It provides built-in evaluators — groundedness, relevance, coherence, fluency, safety — and supports custom evaluators implemented as Python functions. The key design is that evaluation runs as a pipeline: define your test dataset, run all evaluators against it, check scores against thresholds, and gate deployment on the results.
Built-in evaluators cover the baseline quality dimensions. Groundedness checks whether the model’s output reflects the retrieved context or is hallucinating content not present in the source. Relevance checks whether the output answers the actual query. Safety checks for harmful, biased, or policy-violating content using Azure Content Safety under the hood.
Custom evaluators are where regulated industries get specific value. For our system, a custom evaluator checks whether the KYC decision output cites the regulatory basis — RBI KYC Master Direction section, PMLA reference — for its conclusion. For a retail pricing system, a custom evaluator checks whether the pricing explanation includes the constraint that was applied and the SHAP features that drove the recommendation.
Integrate the evaluation pipeline into your Azure DevOps or GitHub Actions CI/CD pipeline. Every model version update, every prompt change, every new tool registration triggers an evaluation run. Deployment is blocked if any evaluator falls below its threshold. This turns responsible AI from a governance principle into an automated engineering constraint.
[Domain adaptation for foundation models]
The goal of fine-tuning is not to make the model smarter.
It is to make the model:
Think of it as: Creating a specialized enterprise behavior layer on top of a foundation model.
Fine-tuning in Azure AI Foundry allows you to train a foundation model on your own domain-specific data, adapting it to tasks where a general model underperforms. Foundry supports fine-tuning for selected models in the catalogue — including OpenAI GPT-4o mini, Phi, and Llama variants — with the training job managed as a Foundry pipeline and the resulting fine-tuned model deployed as a managed endpoint alongside base model versions.
The use cases where fine-tuning delivers measurable improvement over prompting alone: document classification tasks with domain-specific terminology (financial instrument type classification, legal clause categorisation), extraction tasks where the output format is highly structured and domain-specific (extracting fields from a particular bank’s account statement format), and latency-sensitive tasks where a smaller fine-tuned model can match a larger base model’s accuracy at significantly lower inference cost.
In BFSI, fine-tuning a smaller Phi model on annotated fraud investigation reports — where the output is a structured risk assessment in a specific format — can produce a model that is faster, cheaper, and more consistent than GPT-4o prompted for the same task. The trade-off: fine-tuned models require retraining when the task or output format changes. Prompting a base model is more flexible. Fine-tuning a specialised model is more performant and cost-efficient at scale.
Fine-tuning on customer financial data requires careful data governance. Training data must be anonymised or pseudonymised before leaving your data boundary. Review your organisation’s data processing agreements and the Azure OpenAI data handling policies before initiating a fine-tuning job on production financial data.
[The RAG foundation inside Foundry] Foundation models know public knowledge.
Enterprises need systems that know:
This is where Azure AI Search becomes the retrieval and grounding layer.
Azure AI Search integration is built into Foundry at the connection level — you register an AI Search index as a Foundry connection, and it becomes available as a retrieval tool in Prompt Flow nodes, as a grounding source for Agent Service agents, and as a data source in the playground for testing retrieval quality before building flows.
Foundry’s AI Search integration supports hybrid retrieval — keyword search combined with semantic vector search and metadata filtering — in a single retrieval call. For enterprise knowledge bases where documents have structured metadata (document type, effective date, jurisdiction, product category), metadata filtering combined with semantic search dramatically improves retrieval precision over pure vector search alone.
In a KYC or credit decisioning context, the AI Search index contains regulatory documents — RBI circulars, SEBI guidelines, internal credit policies, product eligibility criteria. When the agent retrieves context before generating a decision, it filters by document type and jurisdiction, retrieves the semantically closest policy sections, and grounds its output in retrieved text rather than model-parameterised knowledge. The retrieved chunks become part of the audit record — what policy text the agent used to reach its conclusion is logged alongside the conclusion itself.
[Built-in content filtering across inputs and outputs] Azure Content Safety is integrated into Foundry as a configurable filter layer that runs on model inputs and outputs. It classifies content across four harm categories — hate and fairness, sexual content, violence, and self-harm — each with configurable severity thresholds (low, medium, high) and a configurable action (allow, block, flag for review). Jailbreak detection is a fifth category that identifies attempts to bypass the model’s instructions or safety guardrails.
Content Safety Framework
1. Input Guardrails
2. Retrieval Safety
3. Reasoning Safety
4. Execution Safety
5. Output Safety
Metrics & Monitoring
👉 In our architecture, this means:
For enterprise architects the key configuration decisions are threshold sensitivity and action per category. A customer-facing banking chatbot has different threshold requirements than an internal analyst tool. A pricing intelligence system has different requirements than a customer complaint handling agent. Foundry allows per-deployment content safety configuration — each managed endpoint can have different filter settings — rather than a single global policy. Content Safety also exposes an API independently of Foundry, so you can run content checks on inputs before they reach the model and on outputs before they reach the user, as part of your own middleware layer. This matters for enterprise systems where the content safety decision itself needs to be logged — the classification result, the severity score, and the action taken — as part of the audit trail for each interaction.
[Linking Foundry to your enterprise data estate]
The AI itself rarely owns business value.
Value comes from the systems it can interact with.
This is where the Connections Framework becomes the integration layer.
Connections are how Foundry accesses resources outside its own boundary — external APIs, Azure services, data stores, and third-party model providers. A connection stores the authentication credentials and endpoint configuration for a resource, making it available to Prompt Flow nodes, Agent Service tool implementations, and playground sessions without embedding credentials in code or flow definitions.
Connection types available in Foundry include Azure OpenAI (separate resource connections for different deployments or regions), Azure AI Search (the RAG retrieval layer), Azure Blob Storage (for training data, evaluation datasets, and document processing inputs), Azure Content Safety (standalone API calls outside the built-in filter), and custom connections for third-party APIs — sanctions list providers, credit bureau APIs, identity verification services — that agent tools need to invoke.
For Hub-level governance, connections defined at the Hub are inherited by all projects. A platform team can define the approved Azure OpenAI endpoints, the production AI Search index, and the sanctioned third-party APIs at Hub level. Project teams connect to these shared resources rather than creating their own. This is the mechanism for enforcing data residency, API security, and cost governance across multiple AI product teams sharing the same Foundry Hub. [Operational visibility into agent and flow execution]
Foundry’s built-in tracing captures the execution detail of every agent run and Prompt Flow execution — every LLM call, every tool invocation, every retrieved document, every token count, every latency measurement — in a structured trace that is viewable in the Foundry portal and queryable via the tracing API. This is operational observability: debugging, performance profiling, and cost analysis.
For each agent run, the trace shows the agent’s reasoning steps in sequence — what it was asked, what tool it selected, what the tool returned, how it reasoned about that result, what it decided to do next. For Prompt Flow executions, the trace shows each node’s inputs, outputs, and duration. Both are invaluable during development and for diagnosing production anomalies. The architectural distinction that matters for regulated systems: Foundry traces are operational logs, not governance audit records. They are not append-only, they do not capture the full business context of a decision, and they are not designed to satisfy a regulatory query about why a specific customer was declined or approved. For governance purposes, build your own audit log — Cosmos DB with append-only access policy — that records the business decision, the regulatory basis, and the constraint checks alongside the agent output. Use Foundry traces for engineering observability. Use your own audit log for compliance.
[Test before you build] The Foundry playground is a browser-based interface for testing models, flows, and agent configurations interactively before committing to a code implementation. You can send messages to any deployed model or agent, adjust system prompts, switch between model versions, test retrieval queries against a connected AI Search index, and see the full response including token usage and latency — all without writing a line of code.
For architects, the playground’s most useful function is comparative testing: deploy two model versions as separate endpoints, test both against the same set of prompts, and assess output quality differences before committing to an A/B split in production. For teams building RAG systems, the playground’s retrieval testing mode — query an AI Search index and see exactly which chunks are retrieved and how they are ranked — is faster and more insightful than writing test code for the same purpose. The playground is also the fastest way to onboard stakeholders and non-technical reviewers into the evaluation process. A compliance officer reviewing whether a KYC agent’s outputs meet regulatory standards can interact with the agent directly in the playground, test edge cases, and provide feedback — without needing access to a development environment or the ability to read code.
This is the question every team building agentic systems on Azure faces, and the wrong answer costs weeks of rework.
The production pattern for complex enterprise AI: Semantic Kernel for agentic orchestration, Prompt Flow for evaluation pipelines that validate the agent’s output quality before and after every deployment. They are complementary — Semantic Kernel owns runtime behaviour, Prompt Flow owns quality assurance.
Three mistakes worth naming because they are common:
The first: treating Foundry as the complete system. An early KYC implementation had the agent output connecting directly to the onboarding system — no constraint enforcement layer between the agent’s verification result and the downstream action. A sanctions match flag in the agent’s output had no enforcement mechanism preventing auto-approval if the confidence threshold was technically met. Compliance review failed immediately. The constraint enforcement service — an Azure Function that validates every agent output against business rules before any downstream action — is not in Foundry. It is infrastructure you design and own.
The second: using Prompt Flow for a multi-agent coordination workflow. A multi-agent system — a document agent, a verification agent, a risk agent, a synthesis agent — does not map cleanly to Prompt Flow’s static graph structure. We spent two weeks building workarounds before rewriting the orchestration in Semantic Kernel and keeping Prompt Flow only for the evaluation pipeline. The right tool for each job was clear in retrospect. It was not clear at the start of the project.
The third: not activating the evaluation framework until after a production incident. We validated the KYC agent manually against fifty test cases before deployment. Six weeks later, a compliance review found that agent outputs were not consistently citing the regulatory basis for their decisions — a requirement under RBI guidelines that our manual testing had not systematically checked. A custom evaluator for regulatory citation, integrated into the deployment pipeline from day one, would have caught the first non-compliant output before it reached production. Evaluation frameworks feel like overhead until the incident they would have prevented actually happens.
Azure AI Foundry is the most complete AI development and serving platform Microsoft has built. The ten components cover the full lifecycle — from model selection and fine-tuning through orchestration, retrieval, safety, evaluation, and observability — in a way that genuinely reduces the infrastructure burden of building production AI systems on Azure.
The architects getting the most value from it are not the ones who use the most components. They are the ones who know what each component does, have made deliberate choices about which to use for each job in their architecture, and have built the surrounding governance layer — constraint enforcement, audit logging, human review workflow — that Foundry itself does not provide.
Azure AI Foundry: The Architect’s Blueprint for Building Enterprise AI at Scale was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.