How a .NET dev built an AI assistant

A .NET developer building an AI assistant for an interactive 3D learning app describes the architectural decisions behind Cori, an assistant that can talk and act simultaneously in a live 3D viewer. The team used Microsoft.Extensions.AI abstractions to keep the codebase vendor-neutral, avoiding lock-in to any single AI provider.

Did you just get the task to “make an AI assistant” — and you mainly do .NET? Same. I’m also one of those people who rolls their eyes at a lot of the AI hype, and the internet is full of articles where every confident tutorial contradicts the previous one. So instead of publishing one more “definitive guide” that will age badly in two weeks, here’s the version I wish I’d found: what my team actually decided to build, the wrong turns we took on the way, and the code that finally made it click — written for people who know C but have never built an AI feature. This is not a best-practices sermon. It’s more like: here’s the problem, here’s what we nearly built, here’s what annoyed me, and here’s what we landed on. I work on an app that helps kids learn through interactive 3D models — a heart, a cell, a volcano — in the browser, AR, and VR. We’re adding Cori , an assistant you can talk to about whatever model is currently on screen. “Rotate the heart left.” → it rotates “Why is this chamber bigger?” →it explains That means Cori has to talk and act at the same time , and both of those outputs have to reach a live 3D viewer. That, for me, is the actual problem. Not “which model is smartest?” Not “which SDK has the coolest demo?” The interesting part is how the output gets to the client . Because once you stop thinking about the model as the product and start thinking about delivery as the problem, the architectural decisions get a lot clearer. Our stack, for context: the backend is .NET with Wolverine + Marten on PostgreSQL and the frontend is Svelte . Fair warning: this is not exactly the most well-paved road in AI land. A lot of AI tooling assumes Python or TypeScript first, and .NET support often arrives later, half-finished, or not at all. So if you’re on a similar stack, you’re probably not picking from polished examples — you’re cutting the path by hand. Before the story, here are the four words every AI article uses as if everybody was born knowing them. Rotate or SearchContent . Mid-response, it can ask for one of those functions to be called. It does not run your C code itself — it asks, and your code executes it. AIAgent type.That’s the whole glossary. Enough vocabulary to survive the rest of the article without having to alt-tab every two minutes. The first decision had nothing to do with streaming or transport. It came from plain distrust. My biggest fear was not “will the model be smart enough?” It was tying the whole codebase to one vendor SDK, one framework, one opinionated stack, and then getting stranded the moment the ecosystem changed direction — which, in AI, it absolutely will. So the first rule became simple: talk to abstractions . Use generic interfaces in application code, and keep the actual provider — OpenAI, Deepgram, whoever wins this month — behind DI where it belongs. Pleasant surprise: .NET actually gives you this now. Microsoft.Extensions.AI is basically the ILogger pattern, but for AI: IChatClient — provider-neutral chat / LLM access IEmbeddingGenerator — embeddings for vector or semantic search ITextToSpeechClient — text-to-speech Microsoft.Extensions.VectorData — vendor-neutral vector store abstractionsThat means the provider stays a registration detail: // Register once: OpenAI hidden behind the generic IChatClient. builder.Services.AddKeyedSingleton<IChatClient "CoriAI", sp, = sp.GetRequiredService<OpenAIClient .GetChatClient "gpt-4o-mini" .AsIChatClient .AsBuilder .UseFunctionInvocation .Build ; And consuming it is boring in exactly the right way: public sealed class Summarizer FromKeyedServices "CoriAI" IChatClient chat { public async Task<string OneLiner string topic, CancellationToken ct { var reply = await chat.GetResponseAsync $"Explain {topic} in one sentence.", cancellationToken: ct ; return reply.Text; } } Nothing in that class knows or cares whether OpenAI is behind it. That is the point. Full honesty: when we built the first version of Cori, we did not follow the neat abstraction rule I just described. We wired the code straight to Semantic Kernel , using its types directly and leaning on a bunch of APIs politely labeled Experimental . Then Microsoft merged the world around it, Microsoft Agent Framework showed up, Semantic Kernel stopped looking like the future, and the framework we had invested time into became “the old path” while we were still building. Which meant we had to rewrite more code than I would like to admit. That experience is exactly why I’m now stubborn about the abstraction layer. Part 0 is not hindsight wisdom from a calm architect on a mountain. It’s the bruise talking. The new pipeline is built on Microsoft.Extensions.AI specifically because the next time Microsoft changes direction — and there is always a next time — I want that change to be a DI swap, not a weekend of find in files and quiet swearing. Once the “brain” side was sorted, the real question became transport: how do the AI’s words and actions actually reach the client? Cori produces two very different kinds of output: And at that point, every .NET developer has the same reflex: we already have a real-time transport, just use SignalR for all of it. On paper, it looked perfect. One connection. One typed client. Browser and Unity both covered. ┌──────── ONE SignalR hub ────────┐ client ◄────►│ text + tool calls + state │ ← semantic │ mic audio up / voice down │ ← audio └──────────────────────────────────┘ Then we wrote down what that would actually mean building: start → args → end That is a surprising amount of custom transport code. And none of that code is the product. It is just plumbing we invented for ourselves, in a format we would have to maintain forever. That was the first important moment: we realized we were about to spend serious effort hand-building a private AI event protocol when maybe, just maybe, someone had already done that part for us. The reason we switched was the same instinct as before: I really did not want to own a custom message format for the next several years. MAF ships with support for AG-UI , an open streaming protocol from the CopilotKit world. And annoyingly enough, it already defines almost the exact event shape we were preparing to invent by hand. Turning the agent into a streaming endpoint is basically one line: js var agent = app.Services.GetRequiredKeyedService<AIAgent "Cori" ; app.MapAGUI "/cori", agent ; From the frontend side, it is just an HTTP POST and a streamed response: js const res = await fetch "/cori", { method: "POST", headers: { "Content-Type": "application/json" }, body: JSON.stringify { message: "Rotate the heart left" } , } ; const reader = res.body .getReader ; // decode each event/data line as it arrives The response comes back as Server-Sent Events — a long-lived HTTP response that keeps writing labeled lines until the run is finished. And the event stream is exactly the sort of thing we needed: event: TEXT MESSAGE CONTENT data: {"delta":"Sure, rotating "} event: TEXT MESSAGE CONTENT data: {"delta":"the heart now…"} event: TOOL CALL START data: {"name":"Rotate"} event: TOOL CALL ARGS data: {"direction":"LEFT","degrees":45} event: TOOL CALL END data: {} event: RUN FINISHED data: {} That is the whole magic trick. The text types itself into the UI. Then the model decides to call Rotate . Then the 3D viewer reacts. Words and actions travel together on one stream , and the streaming protocol, event lifecycle, and session mechanics are not our responsibility anymore. That was the first time the architecture started to feel sane. The agent registration stays pleasantly small: js builder.Services.AddKeyedSingleton<AIAgent "Cori", sp, = sp.GetRequiredKeyedService<IChatClient "CoriAI" .AsAIAgent new ChatClientAgentOptions { ChatOptions = new { Instructions = CoriSystemPrompt.Base }, Tools = / Rotate, Zoom, SearchContent, ... / } ; And grounding it in our own curriculum is just a context provider: public sealed class ContentSearchProvider IHybridContentSearch search : AIContextProvider { public override async ValueTask<AIContext ProvideAIContextAsync InvokingContext context, CancellationToken ct { var q = context.RequestMessages.LastOrDefault m = m.Role == ChatRole.User ?.Text; var hits = await search.SearchAsync q, topK: 3, ct ; return new AIContext { Instructions = $"Use this curriculum if relevant:\n{Format hits }" }; } } Then you attach it to the agent and MAF calls it before each turn: .AsAIAgent new ChatClientAgentOptions { ChatOptions = new { Instructions = CoriSystemPrompt.Base }, AIContextProviders = new ContentSearchProvider search , } That matters because it keeps Cori anchored in our own educational content instead of free-associating from half-remembered internet knowledge. Here is the important catch: AG-UI only carries text and JSON . No binary audio. At first that sounds like a limitation. In practice, it turned out to be the insight. Because audio was always going to be its own problem anyway. So the real choice was never “SignalR or AG-UI?” The real choice was: do we hand-build the text channel on top of SignalR, or take AG-UI for free and solve audio separately — which we were going to have to do either way? Once we phrased it like that, the argument mostly ended itself. This was the actual architectural decision. We never found one ready-made approach that gave us everything — text streaming, tool calls, state, audio, cross-platform friendliness, decent developer ergonomics — in one clean package. So instead of forcing everything through one pipe, we split the problem into two planes. ┌──────── Browser / Unity VR client ────────┐ │ AG-UI POST + SSE Audio WebSocket │ └──────┬──────────────────────────┬─────────┘ text/tools/ │ POST turn / SSE reply │ mic up / voice down state ▼ ▼ ┌──────────────────────┐ ┌──────────────────────────┐ │ MapAGUI "/cori",agent │ │ Audio gateway no agent │ │ AIAgent — TEXT ONLY │◄─┤ speech→text, voice down │ │ tools = 3D commands │ transcript cancel on barge-in│ └──────────────────────┘ └──────────────────────────┘ That last point mattered more than expected. Because once the agent no longer cares whether the input came from typing, speech-to-text, browser chat, or VR, the same brain can serve all of them. The channel becomes an integration detail instead of something smeared through the entire design. This is not a fairy-tale architecture with no cost. Splitting the planes means coordinating two channels. Barge-in gets trickier. The Unity story for AG-UI still feels less proven than I’d like. And the .NET AG-UI host is preview enough that version pinning is not optional. Still, the trade-off felt worth it. Owning an open protocol is better than owning a private one. POST/SSE and WebSocket also pass more easily through school networks than anything that smells like WebRTC or UDP, which matters a lot more in education than flashy diagrams on conference slides. I’m not going to fake certainty here. The text side feels understandable now. Voice still does not. Text is nice because it is turn-based and bounded. One request, one response. Voice is continuous, latency-sensitive, messy, and full of edge cases: users interrupting, classroom noise, weird pauses, headsets with personality disorders, and all the other things reality likes to contribute. What we have right now is a shape , not a finished answer. Conceptually, it looks like this: mic ─► speech-to-text ─► transcript ─► FE ─► MapAGUI run ─► text + tool calls FE ─► audio channel ─► text-to-speech ─► voice down Because “the technically correct real-time media stack” and “the thing that behaves on random school devices, Macs, standalone VR headsets, and mystery tablets” are not always the same thing. WebRTC is great in the right environment. Our environment is not the right environment often enough. So the pragmatic approach is simpler: capture raw PCM on the client and send it over a plain WebSocket. It costs more bandwidth, but it is easier to reason about, easier to debug, and much more predictable across devices. Sometimes the fancy option is right. Sometimes the right option is the one a tired developer can actually support in a school deployment without becoming a part-time audio detective. The obvious concern with this design is hop count: That sounds slow on paper. But in practice, for turn-based conversation, the added delay measured under a second. That is not zero, but it is also not enough to make the interaction feel broken. So we are deliberately not optimizing that path yet. If seamless speech-to-speech becomes a hard requirement, that decision probably changes. This is the honest backlog: So no, this is not the satisfying part of the blog post where everything is solved and there is triumphant orchestral music in the background. Voice is still where the dragons are. If I had to compress the whole thing into a few lessons, it would be these: Microsoft.Extensions.AI outlives whatever framework is fashionable this quarter.That last point is probably my favorite. In AI work, there is always pressure to make the “brain” feel magical. Most of the time, the better move is to make it boring, predictable, and easy to swap around. On the surface, this post is about a .NET team building an assistant for a 3D learning product. But underneath that, it is really about something more familiar: trying to add a new capability without letting the hype blow up the architecture. The decisions we’ve landed on so far are these: Microsoft.Extensions.AI AIAgent exposed over AG-UI MapAGUI .Each of those will probably become its own follow-up article because each one hid more detail than expected. There is still a lot of implementation left. None of this is “battle-tested and perfect.” It is “this is the architecture we chose, this is why we chose it, and this is the shape of the problems we are still solving.” Which, honestly, is probably more useful than another article pretending the road was smooth. If you are a .NET developer trying to build your first AI feature, that is the main thing I would want to pass on: you do not need to start by finding the smartest model. You need to find the seam in your system, protect your abstractions, and avoid writing infrastructure you do not actually want to own.