{"slug": "things-i-learned-building-my-first-multi-agent-ai-system-on-azure-nvidia", "title": "Things I learned building my first multi-agent AI system on Azure + NVIDIA", "summary": "A developer built a multi-agent customer support system on Azure AI Foundry and NVIDIA NIM, learning nine hard-won lessons. Key findings include that token cost varies 5-10x by model, verbatim caching is useless for natural language, and OpenTelemetry requires explicit instrumentation and version pinning. The developer also discovered that NVIDIA's Nemotron models output reasoning in a separate field and that short-lived scripts drop traces without an atexit flush.", "body_md": "I recently built a multi-agent customer support system on Azure AI Foundry and NVIDIA NIM. First time doing anything like this. Made four predictions upfront about what would happen. Three of them were wrong.\n\nHere is what I actually learned.\n\n**1. \"Tokens\" is not a unit of cost**\n\nIt is a unit of work. The price per unit of work varies by 5-10x depending on which model did the work. I was tracking total token count across both the small 9B model and the large 49B model as if they cost the same. They do not. Total tokens went up in the optimized version. Cost in dollars probably went down. I was measuring the wrong thing the whole time.\n\n**2. A verbatim hash cache on natural language traffic deflects ~0% of queries**\n\nI predicted 25-40% cache deflection. The actual number was 0%. Every query in my test set was a unique string, so the hash-based cache never had a single chance to fire. A verbatim cache is not a simpler version of a semantic cache. It is a different thing entirely. If your workload is natural language, build semantic similarity caching from day one, not as an upgrade later.\n\n**3. configure_azure_monitor() does not capture OpenAI SDK calls by default**\n\nYou need to install and initialize opentelemetry-instrumentation-httpx explicitly:\n\npip install opentelemetry-instrumentation-httpx==0.61b0\n\nfrom opentelemetry.instrumentation.httpx import HTTPXClientInstrumentor\n\nHTTPXClientInstrumentor().instrument()\n\nWithout this, your App Insights Logs will show customMetric and\n\nperformanceCounter entries (CPU, memory) but nothing about what your\n\nagent actually did.\n\n**4. Pin your OpenTelemetry versions or everything breaks**\n\nInstalling opentelemetry-instrumentation-httpx without version pinning pulled in opentelemetry-api 1.42.1. But azure-monitor-opentelemetry-exporter needs opentelemetry-api==1.40. The conflict is silent until things start misbehaving. Pin everything to the 0.61b0 / 1.40.0 line:\n\npip install \\\n\n\"opentelemetry-api==1.40.0\" \\\n\n\"opentelemetry-instrumentation==0.61b0\" \\\n\n\"opentelemetry-instrumentation-httpx==0.61b0\" \\\n\n\"opentelemetry-semantic-conventions==0.61b0\" \\\n\n\"opentelemetry-util-http==0.61b0\"\n\nThen run pip check to confirm no broken requirements.\n\n**5. Short-lived Python scripts exit before OTel's batch exporter fires**\n\nOpenTelemetry batches traces and sends them every few seconds in the\n\nbackground. If your script finishes before that timer fires, the traces are dropped silently. Not delayed. Gone. Add an atexit flush:\n\nimport atexit\n\nfrom opentelemetry import trace\n\ndef _flush():\n\nprovider = trace.get_tracer_provider()\n\nif hasattr(provider, \"force_flush\"):\n\nprovider.force_flush()\n\natexit.register(_flush)\n\nThis guarantees buffered traces get pushed out before the process exits.\n\n**6. Nemotron Nano and Super put output in reasoning_content, not content**\n\nOn short prompts, both models spend their token budget on internal\n\nreasoning and never produce a content field. It comes back as None.\n\nmsg.content.strip() then crashes with AttributeError.\n\nAlways extract text like this:\n\ntext = (\n\nmsg.content\n\nor getattr(msg, \"reasoning_content\", None)\n\nor \"(no response)\"\n\n)\n\nThis applies everywhere you read model output: classifiers, answer\n\nfunctions, test scripts, all of it.\n\n**7. The NVIDIA model name in the catalog is not the API model string**\n\nnvidia/nemotron-nano-9b-v2 returns 404. The actual API string has a\n\ndouble prefix:\n\nnvidia/nvidia-nemotron-nano-9b-v2\n\nGo to build.nvidia.com, open the model card, click the Python tab, and copy the model= value directly. Do not guess from the catalog name.\n\n**8. max_tokens=10 does not work for reasoning models doing classification**\n\nI set max_tokens=10 for my classifier call, expecting a one-word label back. Nemotron Nano spent all 10 tokens on reasoning trace and never produced a label. content came back None. Set at least max_tokens=100 for any call to a reasoning model, even simple classification tasks.\n\n**9. Routing decisions need their own log line**\n\nI built a router, ran 81 queries through it, got a full benchmark result, and still cannot tell you with confidence what it routed where. The per-category tables in my benchmark were grouped by ground-truth label, not by what the router actually decided. These are not the same thing.\n\nLog the routing decision explicitly on every query, separate from\n\neverything else.\n\n**10. Graceful degradation cannot be tested in a sequential benchmark**\n\nI built a downshift mechanism that triggers when the reasoning model's rolling p95 latency exceeds 4000ms. It requires 20 samples in the window before it can activate. My entire eval set had 12 reasoning queries. The mechanism was guaranteed to never trigger before I ran a single query.\n\nTo test saturation behavior you need either a much larger dataset or a dedicated concurrent load test, not a sequential single-pass benchmark.\n\n**11. Homebrew Python 3.12 has a libexpat conflict on some macOS versions**\n\npython3.12 -m venv .venv fails with:\n\nImportError: Symbol not found: _XML_SetAllocTrackerActivationThreshold\n\nHomebrew's Python was compiled against a newer libexpat than what macOS ships. The fix is pyenv:\n\nbrew install pyenv\n\npyenv install 3.12.13\n\npyenv local 3.12.13\n\npython -m venv .venv\n\nEverything works after that.\n\n**12. The operational layer is harder than the model layer**\n\nEvery mistake in this list is something I built, measured, or configured wrong. None of them are problems with Azure AI Foundry or NVIDIA NIM. The models worked. The platforms worked. The gaps were all in how I instrumented, tested, and measured the system around them.\n\nThat is probably the most useful thing I learned.\n\nFull benchmark results and write-up on my blog:\n\n[https://sachin.magonus.com/2026/06/17/multi-agent-poc-benchmark-foundry-nvdia-nim-results/](https://sachin.magonus.com/2026/06/17/multi-agent-poc-benchmark-foundry-nvdia-nim-results/)\n\nFramework post (the architecture and predictions before I ran anything):\n\n[https://sachin.magonus.com/2026/01/16/multi-agent-framework-foundry-nvidia/](https://sachin.magonus.com/2026/01/16/multi-agent-framework-foundry-nvidia/)", "url": "https://wpnews.pro/news/things-i-learned-building-my-first-multi-agent-ai-system-on-azure-nvidia", "canonical_source": "https://dev.to/sachin_magon/things-i-learned-building-my-first-multi-agent-ai-system-on-azure-nvidia-325p", "published_at": "2026-06-29 21:15:09+00:00", "updated_at": "2026-06-29 21:48:52.931242+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "developer-tools", "ai-infrastructure"], "entities": ["Azure AI Foundry", "NVIDIA NIM", "Nemotron Nano", "OpenTelemetry", "Azure Monitor", "OpenAI", "App Insights", "HTTPXClientInstrumentor"], "alternates": {"html": "https://wpnews.pro/news/things-i-learned-building-my-first-multi-agent-ai-system-on-azure-nvidia", "markdown": "https://wpnews.pro/news/things-i-learned-building-my-first-multi-agent-ai-system-on-azure-nvidia.md", "text": "https://wpnews.pro/news/things-i-learned-building-my-first-multi-agent-ai-system-on-azure-nvidia.txt", "jsonld": "https://wpnews.pro/news/things-i-learned-building-my-first-multi-agent-ai-system-on-azure-nvidia.jsonld"}}