How I made AI agents safe to run on real infrastructure

wpnews.pro

cd /news/ai-agents/how-i-made-ai-agents-safe-to-run-on-… · home › topics › ai-agents › article

[ARTICLE · art-21255] src=dev.to ↗ pub=2026-06-04T07:07Z topic=ai-agents verified=true sentiment=↑ positive

How I made AI agents safe to run on real infrastructure

A developer built Cmdop, an agent-to-infrastructure platform that runs LLM agents on live systems, and solved the reliability problem by implementing a verification loop that scores every tool call before, during, and after execution. The system treats agent output as something to verify, score, and constrain rather than execute directly, using structured-output contracts, full tracing, and side-effect scoring to catch actions that complete tasks while also deleting critical data. This approach enabled the developer to safely widen agent autonomy from human-in-the-loop to autonomous operation by measuring reliability through evaluation data rather than relying on model confidence.

read4 min views15 publishedJun 4, 2026

Draft flagship post — publish on your blog (canonical), then cross-post to dev.to / Hashnode and as a LinkedIn article. ~1,100 words. This is your single strongest, most differentiated story — it's what makes a hiring manager think "this person actually gets agent reliability."

Everyone can get an LLM agent to do something impressive in a demo. Far fewer can get one to act on live infrastructure — your machines, terminals, files, deployments — without occasionally doing something catastrophic.

That gap is the whole problem. And it's not a model problem. The model is the easy part now. The hard part is making an agent's actions trustworthy enough that you'd let it run unattended against systems that matter.

I built Cmdop around exactly this problem. Here's the architecture and, more importantly, the reliability loop that turned it from "a demo that works once" into a runtime I actually trust.

Cmdop is an agent → gRPC → server → SDK platform. Agents act on remote machines through a multi-agent runtime: they hand off to each other, call tools, and operate under human-in-the-loop control where it matters. Developers integrate the primitives directly through Node.js, Python, and React SDKs, and the whole thing runs thousands of concurrent agent sessions over persistent gRPC/WebSocket streams.

None of that is the interesting part. Plenty of systems can route an LLM's output to a shell. The interesting part is what happens around every single tool call.

A plausible-looking action is the most dangerous thing an agent produces. rm -rf ./logs

and rm -rf /logs

look almost identical and differ by a catastrophe. An agent that's "usually right" is, on real infrastructure, a system that will eventually take down production confidently.

So I stopped treating agent output as something to execute and started treating it as something to verify, score, and constrain — before, during, and after execution.

Every tool call in Cmdop runs through the same loop:

Structured-output contract, validated before execution. The agent doesn't emit free text that I parse hopefully. It emits a structured contract — intended action, parameters, expected effect — and that contract is validated against a schema before anything runs. Malformed or out-of-policy calls never reach the system.

Full trace, logged. Prompt, tool call, result, latency, and which retry/failover path it took — all captured. You cannot improve what you cannot see, and agent failures are subtle: the run "succeeded" but did the wrong thing.

Scored on the axes that matter. Not "did it return 200." Each run is scored on tool-call validity, task success, and — the one everyone forgets — unintended side-effects. The side-effect score is what catches the agent that completed the task and deleted something it shouldn't have.

Guardrails + automatic retry/failover where evals expose brittleness. When the eval data showed a step was fragile, I didn't just log it — I added a structured guardrail and an automatic retry/failover path. Brittle steps became contained steps.

The payoff wasn't a dashboard. It was autonomy I could widen safely.

Before the loop, every meaningfully risky action needed a human in front of it, because I had no principled way to know which actions were safe to let run. After the loop, I had data: which tool calls were reliably valid, which tasks completed cleanly, which steps produced side-effects. That let me move actions from "human-in-the-loop required" to "autonomous" one measured step at a time, instead of guessing.

That's the real lesson, and it generalizes far beyond Cmdop: agent autonomy is earned through evaluation, not granted by confidence. The teams shipping agents into production aren't the ones with the best prompts — they're the ones who instrumented and scored agent behavior until they knew, with data, where autonomy was safe.

The interesting frontier in agentic AI right now isn't bigger models — it's the reliability layer: evals, guardrails, observability, the harness around the agent. That's the part that decides whether agents stay demos or become infrastructure. It's also, conveniently, the part I find most interesting to build.

If you're working on agent reliability, agent platforms, or making AI safe to run against real systems — I'd love to compare notes. I'm Mark K. (Igor Korotin), a Principal Product Architect / Technical CPO building applied-AI platforms. More at cmdop.com and djangocfg.com, code at github.com/markolofsen.

source & further reading

dev.to — original article You Didn't Build a System. You Wrote a Script. AI Agents That Live Inside a Dreamed-Up World I Gave 3 AI Agents a Decaying Notepad and They Built a Culture

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-i-made-ai-agents-saf…

Read original on dev.to → dev.to/markin/how-i-made-ai-agents-safe-to-run-o…

mentioned entities

Cmdop

Node.js

Python

React

gRPC

WebSocket

metadata

slughow-i-made-ai-agents-safe-to-run-on-real-infrastructure

topic#ai-agents

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevA studio inside a tensor

next →His AI Said 'Swap the PSU.' He S…

── more in #ai-agents 4 stories · sorted by recency

venturebeat.com · 21 Jul · #ai-agents

DeepSeek cut prices 75%. The 100x problem remains

dev.to · 17 Jul · #ai-agents

🤖 Isbar-Si AI — Website-ka AI Chatbots & Courses

digiday.com · 21 Jul · #ai-agents

In Graphic Detail: AI visibility is no longer about referral traffic

digiday.com · 21 Jul · #ai-agents

‘We’re starting to wonder’: Ad industry chases AI value as usage outpaces proof

── more on @cmdop 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required