# Making a fleet of self-hosted LLM agents trustworthy

> Source: <https://dev.to/defilan/making-a-fleet-of-self-hosted-llm-agents-trustworthy-49e4>
> Published: 2026-06-14 18:26:35+00:00

Originally published at[llmkube.com/blog/making-self-hosted-llm-agents-trustworthy]. Cross-posted here for the dev.to audience.

Running a single local LLM node is a solved problem. You write an InferenceService, the operator schedules it, llama.cpp or MLX serves it, and you get an OpenAI-compatible endpoint. We have been doing that for months.

Running a *fleet* of them is where it stops being easy. My fleet is heterogeneous on purpose: CUDA pods in the cluster, and Apple Silicon Macs sitting off-cluster on the homelab network, each one running two separate agents (one for inference, one for the agentic coding harness). The day I shipped 0.8.4 to that fleet, I learned exactly how it does not scale.

I updated each Mac by hand. The control plane had no idea what version any agent was running. And the launchd reload I used to restart an agent was a silent no-op on an already-loaded service, so the old binary kept running while I believed I had updated it. I found that out by hand-inspecting a process tree. Three machines made it annoying. Thirty would make it impossible, and the whole pitch for sovereign, on-prem AI is that you run a lot more than three.

So the last stretch of work on LLMKube was not about a faster runtime or a bigger model. It was about making the fleet *trustworthy*: able to update itself safely, and unable to lie to the control plane about its own state. Here is what that took.

The fix is a new cluster-scoped CRD, `AgentRelease`

, and a self-update path in the agents themselves. You describe the release you want once, the operator rolls it out, and the agents pull and apply it. The design borrows directly from prior art that already solved this for Kubernetes nodes: Rancher's system-upgrade-controller, k0s autopilot's per-platform SHA-256 staging, and Teleport's outbound-only poll model.

The properties that make it safe to leave running:

`AgentRelease`

names the agent, the version, and the per-platform artifacts (URL plus SHA-256). Nothing moves until a human flips `approved: true`

. The approved CR is the trust anchor.`current`

symlink atomically, and keeps a `previous`

symlink for a one-command rollback. A bad checksum leaves the running version untouched.The end state is that a release I cut becomes a one-line `kubectl apply`

and an approval, instead of an afternoon of SSH. I proved the whole loop on a live node: publish a version, apply the `AgentRelease`

, watch it sit at `AwaitingApproval`

, approve it, and watch the node drain, download, verify, flip, restart onto the new binary, and report back, the rollout closing out at `Succeeded`

. The first one is still a manual hop (an agent on the old, unaware binary cannot update itself to the version that teaches it how), but every release after that is hands-off.

An auto-updating fleet that lies about its health is worse than a manual one. So alongside the update path, a batch of less glamorous reliability work, the "trustworthy fleet" milestone, had to land.

`kubectl apply`

time, so an invalid spec is rejected at the door instead of failing confusingly three steps later when a task gets dispatched to it.None of these are features you would put on a billboard. They are the difference between a demo and something you would leave pointed at production hardware.

Here is the honest build-in-public bit, and the reason I trust this work more than I would trust a green test suite alone.

When I ran the very first live self-update against a real node, it did not engage. The agent logged that self-update was disabled because it was "not running from a managed install root", which was wrong: it was running from exactly that root. The detection compared the running binary's resolved path against the literal `current/`

symlink path, but resolving the binary's path followed the symlink to the real versioned directory, so the two could never match. The unit test had passed for two reasons: it fed the check an unresolved path that never happens in production, and it cached its answer once, forever, so it could not have noticed anyway. The feature had quietly disabled itself on every real install, and only dogfooding the actual rollout surfaced it. The fix was small. Finding it required running the thing for real.

Then there was the end-to-end test. I wrote it specifically to catch install-path bugs that unit tests cannot see, and it caught one on its first CI run: a task reached `Scheduled`

and then stalled, because the agent was watching one namespace while the task lived in another. The scheduler assigned the work; the agent never saw it. That is exactly the class of bug a real apiserver surfaces and a mock does not. The test earned its place before it had even merged.

I am not going to pretend the rest of the cycle was clean either. Pinning a webhook's TLS certs the simple way tripped a CI script that had been quietly passing a giant blob through an environment variable, which works on macOS and dies on Linux. A glob model pattern that routed correctly one way compiled to a literal that matched nothing the other way, while reporting itself healthy. Every one of these passed review or local checks and got caught by the next layer: a full lint, a real cluster, an adversarial second look. That layering is the point. The goal was never zero bugs. It was no bug that survives to a node you cannot reach.

It is tempting to think the hard problem in self-hosted AI is the inference: the quantization, the GPU memory, the tokens per second. Those are hard, and we spend plenty of time there. But the thing that actually keeps people on a managed cloud is not raw capability. It is that someone else runs the fleet. Updates land, dead nodes get pulled, bad config gets rejected, and you do not think about any of it.

If sovereign AI is going to be a real alternative and not a hobby, it has to offer that same "do not think about it" property while keeping the data and the models on hardware you own. A fleet you have to babysit by hand is not sovereign in any way that matters; it is just someone else's operational burden moved onto you. The work in this post is the unglamorous half of closing that gap: a fleet that updates itself safely, tells the truth about its own health, and refuses to accept a configuration that would break it.

That is the control plane I want for local AI at scale. It is in LLMKube now, it is open source, and it caught its own bugs on the way in.

*LLMKube is a Kubernetes operator for self-hosted LLM inference: CUDA, Apple Silicon Metal, multi-GPU, and a heterogeneous fleet under one control plane. Apache 2.0, github.com/defilantech/LLMKube.*
