# OpenTelemetry Tells You What Your Agent Did. Not Whether It Was OK.

> Source: <https://dev.to/michaeltuszynski/opentelemetry-tells-you-what-your-agent-did-not-whether-it-was-ok-1gmo>
> Published: 2026-06-30 22:00:49+00:00

OpenTelemetry's GenAI conventions will tell you your agent called Claude, spent 1,843 input tokens, took 900 milliseconds, and returned without an error. They will not tell you the answer cited zero sources, that the loop spun nineteen times before it gave up, or that the model never saw the guardrail that was supposed to stop it. Those are the facts that decide whether an agent is safe to run unattended. No standard layer captures them.

So I built a small one. [ballast](https://github.com/michaeltuszynski/ballast) sits on top of OpenTelemetry: OTel tells you what happened; ballast tells you whether it was acceptable.

OTel already owns the telemetry substrate — provider, model, token counts, latency, status. That problem is solved, and solved as a standard. ballast doesn't touch it. What it adds is the reliability layer, expressed as `ballast.*`

attributes and events riding on the same `gen_ai.*`

spans:

You instrument an existing call by wrapping it. Nothing about your stack changes:

``` js
import { wrap, evidenceGuardrail } from '@michaeltuszynski/ballast';

const answer = await wrap(
  { name: 'gen_ai.chat', system: 'anthropic', model: 'claude-sonnet-4-5' },
  async (ctx) => {
    const res = await callYourModel();
    ctx.setUsage(res.inputTokens, res.outputTokens, res.costUsd);
    ctx.guardrail(evidenceGuardrail(res.text));
    return res.text;
  },
);
```

`wrap`

opens a real OTel span, lets you record usage and reliability results onto it, and exports a protocol-conformant record to a `runs.jsonl`

. Then `ballast runs`

reads it back.

The first design had ballast defining its own trace schema — provider, model, tokens, the works. I had a second model review the spec before I wrote a line of code, and it caught the mistake in one paragraph: OpenTelemetry already standardizes all of that. Reinventing it would have put ballast in a fight it can't win against a convention with a working group behind it.

So the protocol got rebuilt on the OTel GenAI semantic conventions, and ballast's surface shrank to the one thing nobody standardizes: reliability semantics. That review is why the repo exists in the shape it does. The lesson generalizes — the substrate is rarely the greenfield you assume it is.

ballast is narrow, and staying narrow is the point.

It's not an agent framework. No chains, no memory, no tool execution, no orchestration. Bring your own runtime — Claude Code, the raw SDK, LangChain — and wrap the calls. The moment a reliability layer grows an orchestration engine, it stops being a reliability layer.

It's not a tracing backend. If you only need raw LLM telemetry, use OpenTelemetry, Langfuse, or OpenLLMetry directly. ballast emits OTel; it doesn't replace your collector.

And it doesn't pretend to see everything. Wrapping arbitrary agent code means hidden retries, streaming partials, and tool calls can slip past the instrumentation. A reliability layer that reports an incomplete trace as complete is worse than no layer — it manufactures confidence. So every span carries a `ballast.trace.completeness`

flag, and each adapter declares what it can actually observe. "Partial" is a first-class answer.

The contracts-guardrails-bounded-loops discipline isn't theoretical. It's what kept agent platforms I've run in production from drifting — the difference between an agent that ships a clean statement of work and one that quietly invents a clause nobody catches until a customer does. ballast is that discipline pulled out of internal tooling and rebuilt as something standards-based and small enough to drop into anyone's stack.

This is the MVP: a TypeScript SDK, the protocol, a local JSONL store, and a CLI viewer. The Python SDK and eval-as-gates — running a prompt across several models and gating on the result — are the next slices, and the schema already carries them.

The repo is [MIT, thirty tests, built on OTel](https://github.com/michaeltuszynski/ballast). Clone it, run `npm run example`

, and watch a span land in `ballast runs`

. Then wrap one of your own calls and see what your traces haven't been telling you.