Minimum viable AI observability: what to set up after shipping your first AI feature

wpnews.pro

What is AI observability? #

** AI observability** (often used interchangeably with

LLM observability and

LLM analytics) is the practice of monitoring what's happening inside the AI features of your app: the prompts you send, the responses you get back, how long calls take, how much they cost, and whether the outputs are any good.

Traditional APM tools (Datadog or New Relic weren't built for this. They can tell you your API call to OpenAI succeeded with a 200, but they can't tell you the response was a hallucination, or that your prompt is costing 3x what it should be.

AI observability fills that gap.

The core building blocks are:

Traces– the full record of an LLM call (or chain of calls) including inputs, outputs, model, parameters, and timing** Cost and token tracking**– how much each call costs, broken down by model, user, or feature** Evaluations (evals)– ways to score thequalityof outputs, not just whether they succeededPrompt management**– versioning and iterating on prompts without re-deploying code** Feedback loops**– capturing user signals (thumbs up/down, retries, abandonment) on AI outputs

You don't need all of these on day one. Most MVPs need two.

Why does my MVP need observability at all? #

Three reasons, in order of how painful they are when you skip them:

AI features burn money and can be sneaky about it–trust us,. A single retry loop bug or a prompt that doubled in length can drain a daily API budget overnight. You won't see this in your traditional product analytics, but you'll see it in your monthly bill.we knowAI failures look like working code. Your function returned a string. The string is wrong. There's no exception, no error log, no red flag in monitoring – just a user reading nonsense. You need a way to catch this that isn't "wait for a tweet dunking on your chatbot."You can't improve what you can't see. When a user says "the AI did something weird," "weird" is not a debuggable input. Without traces, you're guessing.

The good news is that the first version of AI observability is easy to set up, and the free tier of most tools will cover you well past your first thousand users.

What to set up on day one #

Three things, all of which can be wired up in an afternoon (especially if you're using PostHog, since the Wizard can do all the heavy lifting for you).

Install PostHog with one command

Paste this into your terminal and make AI do all the work.

1. Tracing every LLM call

The foundation of everything else. For each call, capture the full input (prompt + context + parameters), the full output, the model used, the latency, and the token counts.

This is the data you'll use to debug literally everything else: cost questions, latency questions, quality complaints, weird outputs.

If you remember nothing else from this guide, set up tracing.

Most LLM observability tools handle this with a small SDK wrapper around your OpenAI or Anthropic client. Initialize the wrapped client once, use it like the normal SDK, and every call is logged automatically.

2. Cost tracking

Once tracing is set up, cost tracking comes nearly for free – your tool calculates cost per call from the token counts and model. What you actually need to do is set up alerts so you know when something goes sideways.

At minimum:

An alert when daily spend exceeds your expected baseline by some multiplier (for example, 2x your 7-day rolling average)
A breakdown by model, so you can spot when something is calling gpt-5.5

instead of`gpt-5.4-mini`

for example - A breakdown by user, so one power user (or one abuse case) doesn't tank your margins

Take it from us: the risk of overpaying isn't hypothetical.

When the PostHog team dug into the cost of our own AI agent (the PostHog Wizard), we found it was burning tokens on three things we hadn't expected: re-reading files it had already read, redoing work after context compactions, and spawning subagents that didn't reliably terminate. The only reason we caught it was because the cost data was sitting right there in AI observability. Without it, we would've just paid the oversized bill and moved on.

3. Error capture

LLM calls fail in more interesting ways than regular API calls. Beyond the usual suspects (rate limits, timeouts, network errors), you'll see things like:

Tool-call failures– the agent tried to call a tool, the tool errored or returned malformed data, and the loop may or may not have recoveredAgent loops that don't terminate– the model keeps calling itself, spawning subagents, or refusing to exit** Structured output parsing failures**– the model returned valid text but invalid JSON when you expected structured data** Token/context limit errors**– the prompt or conversation overflowed, often silently** Content policy violations**– the provider blocked the response, your code needs to handle it gracefully

You want all of these surfaced. If you already have error tracking set up for the rest of your app, make sure LLM errors are flowing into it.

If you don't, add it. The combination of traces + errors + logs is how you debug AI features without losing your mind. Errors tell you something broke; traces tell you what the agent was doing when it broke; logs (tool calls, intermediate state, decisions) tell you why.

What to add later #

The following are where most MVP-stage teams over-invest. These are all legitimate practices, they just might not be the right thing to spend your time on when you have less than a thousand users and one AI feature.

Complex eval suites

You don't need 15 evaluation criteria across five different dimensions of quality, or an ensemble of LLM judges cross-checking each other. You probably don't even need automated evals yet.

Evals are essential eventually, but at the MVP stage they tend to be premature. Why? Because:

You don't have enough traffic to make automated scoring statistically meaningful
You don't know what "good" looks like for your product yet, so any rubric you write could be wrong
You can do better than evals at this scale by reading your outputs. Spend 30 minutes a week looking at random traces. You'll catch more than any rubric will.

Add automated evals once you have enough generations that you can't read them all (usually a few hundred per day).

When you're ready, our beginner's guide to testing AI agents walks through what a minimal eval suite actually looks like – a small dataset from real user queries and recent bugs, one or two cheap code-based evaluators, one LLM-as-a-judge for a subjective criterion, and a regular trace review ritual.

A/B testing prompt variants

Prompt experimentation is great, we just shipped it in beta – you can A/B test up to 10 prompt versions side by side, with cost, latency, and eval pass rate as built-in metrics.

There's even an agent prompt you can paste into Cursor or Claude Code that wires the whole experiment up for you; we're already running our own experiments on it.

That said: it's a feature to grow into, not a feature to start with. Prompt experimentation is pointless before you have enough traffic to detect a difference between variants. If you have 50 users a day, a 10% lift in quality won't reach statistical significance for months.

Until you have meaningful traffic: just ship the better-feeling prompt and move on. Track outputs so you can revert if something breaks. Come back to it once you're seeing a few hundred generations a day and you have a real hypothesis worth testing.

Heavy prompt management infrastructure

If only one person on the team writes prompts, and you ship prompt changes through your normal code deploys, you don't need a separate prompt management system yet. A prompts.py file in your repo is fine.

Prompt management becomes valuable once (a) multiple people are editing prompts, (b) you want non-engineers to iterate on prompts without a deploy, or (c) you want to version and roll back prompts independently of code. None of those usually apply on day one.

Custom-built observability infra

Don't do it. You'll spend a month building something half as good as what already exists, then spend the rest of your valuable free time maintaining it instead of shipping your product.

You can always migrate later, but realistically you won't need to (also, migrations normally suck, so save yourself the headache).

There are free tiers everywhere – PostHog gives you 100K LLM events/mo free, Langfuse is free to self-host, Datadog includes 40K LLM spans/mo on its free tier... use one of them instead.

For a deeper dive into free and self-hostable options, see our comparison of [the best open source LLM observability tools].

Drift detection, governance dashboards, AI red-teaming pipelines

All cool features. All things you do not need at MVP stage.

File these mentally under "stuff to revisit when you have a real AI ops function."

A stage-by-stage cheat sheet #

As your product matures, the bar for observability rises. Here's the rough order to layer things in.

Stage 1: You have meaningful traffic (a few hundred generations a day) #

Add automated evals. Now you can't read every trace, and you need a way to spot regressions. Start with deterministic, code-based evals where you can – simple checks like "the agent called the right tool," "no forbidden keyword appeared in the output," "no error was raised," or Levenshtein distance against an expected response. They're cheap, fast, and don't add an LLM dependency to your test pipeline. Only reach for an LLM-as-a-judge when the criterion is genuinely subjective enough that code can't capture it (tone, helpfulness, hallucination detection). Run both on a sample of production traces.

Add user feedback collection. Even a simple thumbs up/down on AI outputs is gold for debugging. Pair that signal with the trace and you can quickly find patterns in what users hate – PostHog surveys handle this in-app and tie responses back to the same user.

** Add session replay.** Watching how users actually react to AI outputs reveals things no metric will. Did they retry? Edit? Copy-paste? Close the tab? See our guide on

the best session replay toolsif you don't already have this.

Stage 2: You have multiple AI features and a small team #

Move evals online. Your offline evals only cover inputs you defined. Online evals run against a sample of live production traces (5-10% is plenty) so you catch things your test set never imagined. Use cheap code-based evaluators for everything you can, and cheaper LLM judges for the subjective stuff.

Start a trace review ritual. The PostHog teams working on AI features run a weekly Traces Hour where they look at flagged traces (low feedback scores, error spikes, support tickets, outlier clusters) and turn the patterns into new evals. This is where the next generation of evaluators comes from.

Connect LLM data to product analytics. "Users who hit our AI feature retain at 2x" is the kind of insight that justifies the whole investment. To do this, your LLM observability needs to share a user model with your product analytics – which is one reason bundled tools tend to win at this stage.

Consider prompt management (if it fits). Once more than one person edits prompts, or you want non-engineers to iterate without a deploy, prompt management starts earning its keep. Many teams never need this – a prompts.py

file in your repo is fine if you're shipping prompt changes through code anyway.

Stage 3: You're scaling #

Add drift monitoring. Models change (especially if you're on the latest version of an API). Inputs change. Your evals need to keep up. Track quality scores over time and alert on regressions.

Add cost optimization workflows. Things like routing simpler requests to cheaper models, caching responses where you can, and identifying users or prompts that are eating budget.

Add governance. If you're selling to enterprise, you'll need things like PII redaction, data residency, audit logs, and human-in-the-loop review queues for high-risk outputs. This is where dedicated tools start to earn their keep.

The right tool for "day one" should also support most of "later" without needing to migrate – otherwise you're going to do this all again in six months.

For a deeper comparison of tools available, see our roundup of the best open source LLM observability tools. And if you're earlier in your product journey, our guide to the best analytics stack for vibe-coded apps covers the whole analytics stack, not just the AI parts.

Why PostHog works well for this #

A quick pitch, since this is our blog after all:

PostHog's AI observability covers all of the above in one place, with a free tier (100K events/mo) generous enough to get you well past MVP. You get:

with a small SDK wrapper around OpenAI, Anthropic, and other providersTracingCost, latency, and token tracking by model, user, and featurewith LLM-as-a-judge and human reviewEvaluationsalongside traces, so AI-specific failures (tool calls, parsing errors, loop termination) show up next to the calls that caused themError trackingwith versioning, fetch-at-runtime, and rollbackPrompt management– the same user IDs flow through, so you can correlate AI usage with retention, conversion, and churnConnected product analyticsalongside traces, so you can see exactly what the user was doing when the model misbehavedSession replayAn that lets you query traces, costs, and evals directly from Claude Code or Cursor while you're iteratingMCP server

If you want to take it for a spin, you can start free – no credit card needed. And if you'd rather skip the manual setup entirely, the PostHog Wizard handles instrumentation for you.

Paste one command into Cursor or Claude Code and it detects your framework, installs the right SDK, wraps your client, and gets traces, cost data, and errors flowing in a few minutes.

Install PostHog with one command

Paste this into your terminal and make AI do all the work.

Frequently asked questions #

How is AI observability different from regular APM? #

APM tools tell you whether your code is running and how fast. They can see that you made an API call to OpenAI and got a response back. They can't see what the model said, whether it was any good, or how much that specific call cost.

AI observability adds the missing layer: the content of the calls, the quality of the outputs, and the economics of LLM usage.

Do I need evals on day one? #

No. At MVP scale, manual review beats automated evals. Open your trace list once a day, read 10-20 random outputs, take notes. You'll learn more about what "good" means for your product in a week of doing this than you would from any rubric.

Once you can't keep up with the volume, that's the signal to set up automated evals.

How much should I expect to spend on observability as a solo founder? #

For most MVPs: nothing; even at a few thousand users, you can usually stay on a free or cheap plan. The thing that costs money is the LLM calls themselves – which is exactly why you want cost tracking from day one.

What if I'm just calling OpenAI or Anthropic directly without a framework? #

That's actually the easiest case. Most observability tools have a wrapper around the OpenAI/Anthropic SDK that captures everything automatically. You don't need LangChain or any other framework to get full observability.

Do I need separate tools for AI observability and product analytics? #

You can have separate tools, but you'll lose the most interesting insights – like whether your AI feature actually drives retention or conversion.

Bundled platforms (PostHog being the obvious example) share a user model across both, so you can ask questions like "do users who use the AI feature stick around longer?" without stitching data sources together.

When should I switch from a free tier to a paid plan? #

When you hit one of three triggers:

Your event volume exceeds the free tier
You need a feature that's gated to paid plans (often SSO, EU residency, longer retention)
You need support SLAs to ship to enterprise.

Otherwise, free tier might work fine well into post-PMF.

What's the difference between observability and evaluation? #

Observability is about what happened – the traces, costs, latency, errors. It's descriptive. Evaluation is about whether what happened was good – scoring outputs against quality criteria. It's prescriptive. You need observability before evaluation makes sense: you can't score outputs you can't see.

Subscribe to our newsletter

Product for Engineers

Read by 100,000+ founders and builders

We'll share your email with Substack

PostHog is an all-in-one developer platform for building successful products. We provide[product analytics],[web analytics],[session replay],[error tracking],[feature flags],[experiments],[surveys],[AI Observability],[logs],[workflows],[endpoints],[data warehouse],[CDP], and an[AI product assistant]to help debug your code, ship features faster, and keep all your usage and customer data in one stack.

source & further reading

posthog.com — original article Karpathy's Autoresearch found a 3-year-old bug in our query engine (and improved performance by 11%) PostHog will train AI models with your data (opt-in by default) The best analytics tool stack for vibe-coded apps