Did you just get the task to βmake an AI assistantβ β and you mainly do .NET? Same. Iβm also one of those people who rolls their eyes at a lot of the AI hype, and the internet is full of articles where every confident tutorial contradicts the previous one.
So instead of publishing one more βdefinitive guideβ that will age badly in two weeks, hereβs the version I wish Iβd found: what my team actually decided to build, the wrong turns we took on the way, and the code that finally made it click β written for people who know C# but have never built an AI feature.
This is not a best-practices sermon. Itβs more like: hereβs the problem, hereβs what we nearly built, hereβs what annoyed me, and hereβs what we landed on.
I work on an app that helps kids learn through interactive 3D models β a heart, a cell, a volcano β in the browser, AR, and VR. Weβre adding Cori, an assistant you can talk to about whatever model is currently on screen.
βRotate the heart left.β β
it rotates
βWhy is this chamber bigger?β βit explains
That means Cori has to talk and act at the same time, and both of those outputs have to reach a live 3D viewer.
That, for me, is the actual problem. Not βwhich model is smartest?β Not βwhich SDK has the coolest demo?β The interesting part is how the output gets to the client.
Because once you stop thinking about the model as the product and start thinking about delivery as the problem, the architectural decisions get a lot clearer.
Our stack, for context: the backend is .NET with Wolverine + Marten on PostgreSQL and the frontend is Svelte. Fair warning: this is not exactly the most well-paved road in AI land. A lot of AI tooling assumes Python or TypeScript first, and .NET support often arrives later, half-finished, or not at all. So if youβre on a similar stack, youβre probably not picking from polished examples β youβre cutting the path by hand.
Before the story, here are the four words every AI article uses as if everybody was born knowing them.
Rotate
or SearchContent
. Mid-response, it can ask for one of those functions to be called. It does not run your C# code itself β it asks, and your code executes it.AIAgent
type.Thatβs the whole glossary. Enough vocabulary to survive the rest of the article without having to alt-tab every two minutes.
The first decision had nothing to do with streaming or transport. It came from plain distrust.
My biggest fear was not βwill the model be smart enough?β It was tying the whole codebase to one vendor SDK, one framework, one opinionated stack, and then getting stranded the moment the ecosystem changed direction β which, in AI, it absolutely will.
So the first rule became simple: talk to abstractions. Use generic interfaces in application code, and keep the actual provider β OpenAI, Deepgram, whoever wins this month β behind DI where it belongs.
Pleasant surprise: .NET actually gives you this now. ** Microsoft.Extensions.AI** is basically the
ILogger
pattern, but for AI:IChatClient
β provider-neutral chat / LLM accessIEmbeddingGenerator
β embeddings for vector or semantic searchITextToSpeechClient
β text-to-speechMicrosoft.Extensions.VectorData
β vendor-neutral vector store abstractionsThat means the provider stays a registration detail:
// Register once: OpenAI hidden behind the generic IChatClient.
builder.Services.AddKeyedSingleton<IChatClient>("CoriAI", (sp, _) =>
sp.GetRequiredService<OpenAIClient>()
.GetChatClient("gpt-4o-mini")
.AsIChatClient()
.AsBuilder()
.UseFunctionInvocation()
.Build());
And consuming it is boring in exactly the right way:
public sealed class Summarizer([FromKeyedServices("CoriAI")] IChatClient chat)
{
public async Task<string> OneLiner(string topic, CancellationToken ct)
{
var reply = await chat.GetResponseAsync(
$"Explain {topic} in one sentence.",
cancellationToken: ct);
return reply.Text;
}
}
Nothing in that class knows or cares whether OpenAI is behind it. That is the point.
Full honesty: when we built the first version of Cori, we did not follow the neat abstraction rule I just described. We wired the code straight to Semantic Kernel, using its types directly and leaning on a bunch of APIs politely labeled [Experimental]
.
Then Microsoft merged the world around it, Microsoft Agent Framework showed up, Semantic Kernel stopped looking like the future, and the framework we had invested time into became βthe old pathβ while we were still building.
Which meant we had to rewrite more code than I would like to admit.
That experience is exactly why Iβm now stubborn about the abstraction layer. Part 0 is not hindsight wisdom from a calm architect on a mountain. Itβs the bruise talking.
The new pipeline is built on Microsoft.Extensions.AI
specifically because the next time Microsoft changes direction β and there is always a next time β I want that change to be a DI swap, not a weekend of find in files
and quiet swearing.
Once the βbrainβ side was sorted, the real question became transport: how do the AIβs words and actions actually reach the client?
Cori produces two very different kinds of output:
And at that point, every .NET developer has the same reflex: we already have a real-time transport, just use SignalR for all of it.
On paper, it looked perfect. One connection. One typed client. Browser and Unity both covered.
βββββββββ ONE SignalR hub βββββββββ
client ββββββΊβ text + tool calls + state β β semantic
β mic audio up / voice down β β audio
ββββββββββββββββββββββββββββββββββββ
Then we wrote down what that would actually mean building:
start β args β end
That is a surprising amount of custom transport code.
And none of that code is the product. It is just plumbing we invented for ourselves, in a format we would have to maintain forever.
That was the first important moment: we realized we were about to spend serious effort hand-building a private AI event protocol when maybe, just maybe, someone had already done that part for us.
The reason we switched was the same instinct as before: I really did not want to own a custom message format for the next several years.
MAF ships with support for AG-UI, an open streaming protocol from the CopilotKit world. And annoyingly enough, it already defines almost the exact event shape we were preparing to invent by hand.
Turning the agent into a streaming endpoint is basically one line:
var agent = app.Services.GetRequiredKeyedService<AIAgent>("Cori");
app.MapAGUI("/cori", agent);
From the frontend side, it is just an HTTP POST and a streamed response:
const res = await fetch("/cori", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ message: "Rotate the heart left" }),
});
const reader = res.body!.getReader();
// decode each event/data line as it arrives
The response comes back as Server-Sent Events β a long-lived HTTP response that keeps writing labeled lines until the run is finished.
And the event stream is exactly the sort of thing we needed:
event: TEXT_MESSAGE_CONTENT data: {"delta":"Sure, rotating "}
event: TEXT_MESSAGE_CONTENT data: {"delta":"the heart nowβ¦"}
event: TOOL_CALL_START data: {"name":"Rotate"}
event: TOOL_CALL_ARGS data: {"direction":"LEFT","degrees":45}
event: TOOL_CALL_END data: {}
event: RUN_FINISHED data: {}
That is the whole magic trick.
The text types itself into the UI. Then the model decides to call Rotate
. Then the 3D viewer reacts. Words and actions travel together on one stream, and the streaming protocol, event lifecycle, and session mechanics are not our responsibility anymore.
That was the first time the architecture started to feel sane.
The agent registration stays pleasantly small:
builder.Services.AddKeyedSingleton<AIAgent>("Cori", (sp, _) =>
sp.GetRequiredKeyedService<IChatClient>("CoriAI")
.AsAIAgent(new ChatClientAgentOptions
{
ChatOptions = new() { Instructions = CoriSystemPrompt.Base },
Tools = [ /* Rotate, Zoom, SearchContent, ... */ ]
}));
And grounding it in our own curriculum is just a context provider:
public sealed class ContentSearchProvider(IHybridContentSearch search) : AIContextProvider
{
public override async ValueTask<AIContext> ProvideAIContextAsync(
InvokingContext context,
CancellationToken ct)
{
var q = context.RequestMessages.LastOrDefault(m => m.Role == ChatRole.User)?.Text;
var hits = await search.SearchAsync(q, topK: 3, ct);
return new AIContext
{
Instructions = $"Use this curriculum if relevant:\n{Format(hits)}"
};
}
}
Then you attach it to the agent and MAF calls it before each turn:
.AsAIAgent(new ChatClientAgentOptions
{
ChatOptions = new() { Instructions = CoriSystemPrompt.Base },
AIContextProviders = [ new ContentSearchProvider(search) ],
})
That matters because it keeps Cori anchored in our own educational content instead of free-associating from half-remembered internet knowledge.
Here is the important catch: AG-UI only carries text and JSON. No binary audio.
At first that sounds like a limitation. In practice, it turned out to be the insight.
Because audio was always going to be its own problem anyway.
So the real choice was never βSignalR or AG-UI?β The real choice was:
do we hand-build the text channel on top of SignalR, or take AG-UI for free and solve audio separately β which we were going to have to do either way?
Once we phrased it like that, the argument mostly ended itself.
This was the actual architectural decision.
We never found one ready-made approach that gave us everything β text streaming, tool calls, state, audio, cross-platform friendliness, decent developer ergonomics β in one clean package.
So instead of forcing everything through one pipe, we split the problem into two planes.
βββββββββ Browser / Unity VR client βββββββββ
β AG-UI (POST + SSE) Audio (WebSocket) β
ββββββββ¬βββββββββββββββββββββββββββ¬ββββββββββ
text/tools/ β POST turn / SSE reply β mic up / voice down
state βΌ βΌ
ββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
β MapAGUI("/cori",agent)β β Audio gateway (no agent) β
β AIAgent β TEXT ONLY ββββ€ speechβtext, voice down β
β tools = 3D commands β transcript cancel on barge-inβ
ββββββββββββββββββββββββ ββββββββββββββββββββββββββββ
That last point mattered more than expected.
Because once the agent no longer cares whether the input came from typing, speech-to-text, browser chat, or VR, the same brain can serve all of them. The channel becomes an integration detail instead of something smeared through the entire design.
This is not a fairy-tale architecture with no cost.
Splitting the planes means coordinating two channels. Barge-in gets trickier. The Unity story for AG-UI still feels less proven than Iβd like. And the .NET AG-UI host is preview enough that version pinning is not optional.
Still, the trade-off felt worth it.
Owning an open protocol is better than owning a private one. POST/SSE and WebSocket also pass more easily through school networks than anything that smells like WebRTC or UDP, which matters a lot more in education than flashy diagrams on conference slides.
Iβm not going to fake certainty here. The text side feels understandable now. Voice still does not.
Text is nice because it is turn-based and bounded. One request, one response. Voice is continuous, latency-sensitive, messy, and full of edge cases: users interrupting, classroom noise, weird s, headsets with personality disorders, and all the other things reality likes to contribute.
What we have right now is a shape, not a finished answer.
Conceptually, it looks like this:
mic ββΊ speech-to-text ββΊ transcript ββΊ FE ββΊ MapAGUI run ββΊ text + tool calls
FE ββΊ audio channel ββΊ text-to-speech ββΊ voice down
Because βthe technically correct real-time media stackβ and βthe thing that behaves on random school devices, Macs, standalone VR headsets, and mystery tabletsβ are not always the same thing.
WebRTC is great in the right environment. Our environment is not the right environment often enough.
So the pragmatic approach is simpler: capture raw PCM on the client and send it over a plain WebSocket. It costs more bandwidth, but it is easier to reason about, easier to debug, and much more predictable across devices.
Sometimes the fancy option is right. Sometimes the right option is the one a tired developer can actually support in a school deployment without becoming a part-time audio detective.
The obvious concern with this design is hop count:
That sounds slow on paper.
But in practice, for turn-based conversation, the added delay measured under a second. That is not zero, but it is also not enough to make the interaction feel broken.
So we are deliberately not optimizing that path yet.
If seamless speech-to-speech becomes a hard requirement, that decision probably changes.
This is the honest backlog:
So no, this is not the satisfying part of the blog post where everything is solved and there is triumphant orchestral music in the background.
Voice is still where the dragons are.
If I had to compress the whole thing into a few lessons, it would be these:
Microsoft.Extensions.AI
outlives whatever framework is fashionable this quarter.That last point is probably my favorite. In AI work, there is always pressure to make the βbrainβ feel magical. Most of the time, the better move is to make it boring, predictable, and easy to swap around.
On the surface, this post is about a .NET team building an assistant for a 3D learning product.
But underneath that, it is really about something more familiar: trying to add a new capability without letting the hype blow up the architecture.
The decisions weβve landed on so far are these:
Microsoft.Extensions.AI
AIAgent
exposed over AG-UIMapAGUI
.Each of those will probably become its own follow-up article because each one hid more detail than expected.
There is still a lot of implementation left. None of this is βbattle-tested and perfect.β It is βthis is the architecture we chose, this is why we chose it, and this is the shape of the problems we are still solving.β
Which, honestly, is probably more useful than another article pretending the road was smooth.
If you are a .NET developer trying to build your first AI feature, that is the main thing I would want to pass on: you do not need to start by finding the smartest model. You need to find the seam in your system, protect your abstractions, and avoid writing infrastructure you do not actually want to own.