When MCP tools break, isolate the server before blaming the agent

A developer debugging an MCP (Model Context Protocol) integration found that an LLM agent's self-report of unavailable tools was misleading. The agent claimed that new stdio MCP servers were not connected, but the real issue was a malformed test invocation, not server failure. The developer advises isolating the server layer and using direct connectivity checks like the MCP Inspector instead of relying on the agent's narration.

You add a new MCP server, or a new tool to an existing one. You ask the agent to use it and it says the tool isn't available — or worse, it confidently reports that "only the HTTP server is connected" and the new stdio servers aren't live. The instinct is to blame the MCP server or bounce the runtime. Resist it. The agent's narration about its own toolset is not ground truth, and a malformed test invocation will happily manufacture a fake "MCP unreachable" symptom. Here's the broke → tried → fixed of a debugging session that burned time on exactly this. We'd just added a second stdio MCP server and a new tool. To smoke-test it, we hand-typed a quick probe at the agent runtime and asked it to use the new capability. The agent came back with this: Those tools aren't in my live toolset right now — only the Home Assistant HTTP/SSE server is connected. I'll read the files directly instead. It then shelled out to read files off disk to answer the question — being resourceful, which made it look like the tools genuinely weren't wired up. Two new stdio servers, both apparently invisible to the session, while the one HTTP/SSE server was fine. That sure reads like the stdio MCP servers aren't connecting. That self-report is the trap. Treating "the agent says the tools aren't available" as a diagnosis — instead of a symptom — is what sent us down the wrong path. There are two failure modes here and they look identical from the agent's chair: The agent can't tell these apart for you, and it will narrate either one as "tools not available." Worse, an LLM describing its own capabilities is itself a confabulation risk — it pattern-completes a plausible story "only the HTTP server is live" around the fact that some tools didn't show up. That story is not evidence about server health. So the discipline is: don't ask the agent whether the server is up. Ask the server. Isolate the layer before you touch anything. We took the self-report at face value. The agent said the stdio servers weren't connected, so we assumed they weren't connected. This is the root mistake every later dead-end inherits from — we let the agent's narration stand in for a connectivity check. It is not one. "Tools didn't load? Refresh them — restart the runtime." So we bounced the long-running agent gateway daemon, the shape of which is: generic shape — restart the long-running agent runtime/gateway <runtime gateway restart Then re-ran the same probe. No change: Still only the Home Assistant server is connected — the other tools aren't in my toolset. Restarting a daemon to "refresh tools" is a guess, not a diagnosis. If the server were genuinely stale you'd want evidence of that first . We had none — we were rebooting on a hunch, and the hunch was wrong. It also cost real time: a gateway restart drops every in-flight session. We re-ran the hand-typed probe a few more times, tweaking the wording, still getting "tools not available." At this point it looks airtight: server's been restarted, agent still can't see the tools, therefore the server must be broken. Every signal pointed at the MCP server — and every signal was coming through the same malformed probe, so they were all the same false negative wearing different hats. Stop reasoning through the agent. Isolate the layer , then reproduce the exact production invocation. Spawn the server fresh, outside any agent session, and ask it to enumerate its tools. Most runtimes ship a connectivity check that does exactly this; the vendor-neutral tool is the MCP Inspector https://modelcontextprotocol.io/docs/tools/inspector . If your runtime has a built-in check generic shape — it spawns the server fresh, bypassing the session : <runtime mcp test <server Or talk to the server directly with the MCP Inspector — no runtime involved at all. The CLI mode is perfect for a connectivity assertion: list the tools a stdio server actually exposes, runtime out of the picture npx @modelcontextprotocol/inspector --cli \ <your-server-launch-command --method tools/list npx @modelcontextprotocol/inspector with no --cli opens the browser UI at http://localhost:6274 — connect, open the Tools tab, click List Tools . The output settles it immediately: Connected. server: signalk → 7 tools discovered: read sensor, battery state, get route, get active alarms, ... server: vessel knowledge → 4 tools discovered: find equipment, get equipment, ... Both servers connect and enumerate every tool. The servers were fine the entire time — through all three dead-ends. That one command would have saved the gateway restart and every ad-hoc re-probe. If the server lists its tools standalone, the bug if any is in how the session loads them , not the server. So stop hand-typing variants and run the agent the way production runs it — same skill/profile selector, same flags, same transport. Our production path a voice/MQTT bridge launches the agent in oneshot mode with a namespaced skill selector and the query passed via the query flag. The malformed probe had used a bare, non-namespaced selector and the wrong flag — which loads a different toolset or none of the stdio toolsets . The shape that matters: WRONG — ad-hoc probe: bare/non-namespaced selector, wrong flag. Loads a different profile; stdio toolsets never attach. <runtime chat -m <model -z "use the new tool" RIGHT — the exact production invocation: namespaced skill selector profile/<skill + the real query flag, oneshot. <runtime chat -Q -s profile/<skill -q "use the new tool" Run the right one and the agent calls the tools on the first try — no restart, no file-reading workaround: tool get active alarms → none active tool find equipment "Bellmarine" → bellmarine-ddw-10 ... rated temperature zones: ... The fix was a one-line change to the invocation. The server never moved. The one rule that would have short-circuited all of it: don't restart a daemon to "refresh tools" before the server is proven stale. Prove it with a standalone mcp test / Inspector run first. The order is the whole lesson: server standalone → invocation exact production form → only then the runtime. Don't bounce the daemon until the first two are ruled out. This came out of wiring MCP servers into a ship's-computer agent for an all-electric charter catamaran — SignalK, vessel-knowledge, and tide/weather tools behind a local LLM. When the agent said it couldn't see them, the servers had been fine all along; the probe was wrong. The MCP servers behind it are open source: github.com/sailingnaturali/signalk-mcp https://github.com/sailingnaturali/signalk-mcp .