Minimum viable AI observability: what to set up after shipping your first AI feature

PostHog has released a guide for small teams and solo founders on setting up minimum viable AI observability after shipping their first AI feature. The guide recommends starting with just three components—tracing every LLM call, tracking costs, and capturing basic user feedback—rather than implementing full enterprise-grade monitoring systems. This approach helps teams catch costly bugs and quality issues without over-engineering their observability stack.

Minimum viable AI observability: what to set up after shipping your first AI feature Contents A lot of "LLM observability" content out there is written for teams running production AI at scale: enterprise governance, fancy eval pipelines, dedicated AI platform engineers. Useful if you're there, but wildly over-engineered if you just shipped your first AI feature on a Tuesday. This guide is for the other half – the small teams, vibe coders, and solo founders who have a couple of agents or LLM calls in their app, no data team, and no idea where to start with AI observability. The answer isn't to set up everything, but to set up the minimum that keeps you from getting blindsided, skip the stuff that won't pay off yet, and add depth as your product earns it. Here's what that looks like in practice. What is AI observability? AI observability often used interchangeably with LLM observability and LLM analytics is the practice of monitoring what's happening inside the AI features of your app: the prompts you send, the responses you get back, how long calls take, how much they cost, and whether the outputs are any good. Traditional APM tools Datadog /blog/best-datadog-alternatives or New Relic weren't built for this. They can tell you your API call to OpenAI succeeded with a 200, but they can't tell you the response was a hallucination, or that your prompt is costing 3x what it should be. AI observability fills that gap. The core building blocks are: Traces – the full record of an LLM call or chain of calls including inputs, outputs, model, parameters, and timing Cost and token tracking – how much each call costs, broken down by model, user, or feature Evaluations evals – ways to score the quality of outputs, not just whether they succeeded Prompt management – versioning and iterating on prompts without re-deploying code Feedback loops – capturing user signals thumbs up/down, retries, abandonment on AI outputs You don't need all of these on day one. Most MVPs need two. Why does my MVP need observability at all? Three reasons, in order of how painful they are when you skip them: AI features burn money and can be sneaky about it – trust us, /blog/optimizing-agent-cost . A single retry loop bug or a prompt that doubled in length can drain a daily API budget overnight. You won't see this in your traditional product analytics, but you'll see it in your monthly bill. we know AI failures look like working code. Your function returned a string. The string is wrong. There's no exception, no error log, no red flag in monitoring – just a user reading nonsense. You need a way to catch this that isn't "wait for a tweet dunking on your chatbot." You can't improve what you can't see. When a user says "the AI did something weird," "weird" is not a debuggable input. Without traces, you're guessing. The good news is that the first version of AI observability is easy to set up /wizard , and the free tier of most tools will cover you well past your first thousand users. What to set up on day one Three things, all of which can be wired up in an afternoon especially if you're using PostHog, since the Wizard /wizard can do all the heavy lifting for you . Install PostHog with one command Paste this into your terminal and make AI do all the work. 1. Tracing every LLM call The foundation of everything else. For each call, capture the full input prompt + context + parameters , the full output, the model used, the latency, and the token counts. This is the data you'll use to debug literally everything else: cost questions, latency questions, quality complaints, weird outputs. If you remember nothing else from this guide, set up tracing. Most LLM observability tools handle this with a small SDK wrapper around your OpenAI or Anthropic client. Initialize the wrapped client once, use it like the normal SDK, and every call is logged automatically. 2. Cost tracking Once tracing is set up, cost tracking comes nearly for free – your tool calculates cost per call from the token counts and model. What you actually need to do is set up alerts /docs/alerts so you know when something goes sideways. At minimum: - An alert when daily spend exceeds your expected baseline by some multiplier for example, 2x your 7-day rolling average - A breakdown by model, so you can spot when something is calling gpt-5.5 instead of gpt-5.4-mini for example - A breakdown by user, so one power user or one abuse case doesn't tank your margins Take it from us: the risk of overpaying isn't hypothetical. When the PostHog team dug into the cost of our own AI agent /blog/optimizing-agent-cost the PostHog Wizard , we found it was burning tokens on three things we hadn't expected: re-reading files it had already read, redoing work after context compactions, and spawning subagents that didn't reliably terminate. The only reason we caught it was because the cost data was sitting right there in AI observability /ai-observability . Without it, we would've just paid the oversized bill and moved on. 3. Error capture LLM calls fail in more interesting ways than regular API calls. Beyond the usual suspects rate limits, timeouts, network errors , you'll see things like: Tool-call failures – the agent tried to call a tool, the tool errored or returned malformed data, and the loop may or may not have recovered Agent loops that don't terminate – the model keeps calling itself, spawning subagents, or refusing to exit Structured output parsing failures – the model returned valid text but invalid JSON when you expected structured data Token/context limit errors – the prompt or conversation overflowed, often silently Content policy violations – the provider blocked the response, your code needs to handle it gracefully You want all of these surfaced. If you already have error tracking set up for the rest of your app, make sure LLM errors are flowing into it. If you don't, add it /blog/best-error-tracking-tools-for-developers . The combination of traces + errors + logs is how you debug AI features without losing your mind. Errors tell you something broke; traces tell you what the agent was doing when it broke; logs tool calls, intermediate state, decisions tell you why. What to add later The following are where most MVP-stage teams over-invest. These are all legitimate practices, they just might not be the right thing to spend your time on when you have less than a thousand users and one AI feature. Complex eval suites You don't need 15 evaluation criteria across five different dimensions of quality, or an ensemble of LLM judges cross-checking each other. You probably don't even need automated evals yet. Evals are essential eventually /blog/stop-ai-slop , but at the MVP stage they tend to be premature. Why? Because: - You don't have enough traffic to make automated scoring statistically meaningful - You don't know what "good" looks like for your product yet, so any rubric you write could be wrong - You can do better than evals at this scale by reading your outputs . Spend 30 minutes a week looking at random traces. You'll catch more than any rubric will. Add automated evals /docs/ai-evals/evaluations once you have enough generations that you can't read them all usually a few hundred per day . When you're ready, our beginner's guide to testing AI agents /blog/testing-ai-agents walks through what a minimal eval suite actually looks like – a small dataset from real user queries and recent bugs, one or two cheap code-based evaluators, one LLM-as-a-judge for a subjective criterion, and a regular trace review ritual. A/B testing prompt variants Prompt experimentation /docs/prompt-management/prompt-experiments is great, we just shipped it in beta – you can A/B test up to 10 prompt versions side by side, with cost, latency, and eval pass rate as built-in metrics. There's even an agent prompt you can paste into Cursor or Claude Code that wires the whole experiment up for you; we're already running our own experiments on it. That said: it's a feature to grow into, not a feature to start with. Prompt experimentation is pointless before you have enough traffic to detect a difference between variants. If you have 50 users a day, a 10% lift in quality won't reach statistical significance for months. Until you have meaningful traffic: just ship the better-feeling prompt and move on. Track outputs so you can revert if something breaks. Come back to it once you're seeing a few hundred generations a day and you have a real hypothesis worth testing. Heavy prompt management infrastructure If only one person on the team writes prompts, and you ship prompt changes through your normal code deploys, you don't need a separate prompt management system yet. A prompts.py file in your repo is fine. Prompt management /docs/prompt-management/prompts becomes valuable once a multiple people are editing prompts, b you want non-engineers to iterate on prompts without a deploy, or c you want to version and roll back prompts independently of code. None of those usually apply on day one. Custom-built observability infra Don't do it. You'll spend a month building something half as good as what already exists, then spend the rest of your valuable free time maintaining it instead of shipping your product. You can always migrate later, but realistically you won't need to also, migrations normally suck, so save yourself the headache . There are free tiers everywhere – PostHog gives you 100K LLM events/mo free /ai-observability , Langfuse is free to self-host, Datadog /blog/best-datadog-alternatives includes 40K LLM spans/mo on its free tier... use one of them instead. For a deeper dive into free and self-hostable options, see our comparison of the best open source LLM observability tools . Drift detection, governance dashboards, AI red-teaming pipelines All cool features. All things you do not need at MVP stage. File these mentally under "stuff to revisit when you have a real AI ops function." A stage-by-stage cheat sheet As your product matures, the bar for observability rises. Here's the rough order to layer things in. Stage 1: You have meaningful traffic a few hundred generations a day Add automated evals. Now you can't read every trace, and you need a way to spot regressions. Start with deterministic, code-based evals /blog/testing-ai-agents deterministic-evaluators where you can – simple checks like "the agent called the right tool," "no forbidden keyword appeared in the output," "no error was raised," or Levenshtein distance /blog/testing-ai-agents deterministic-evaluators against an expected response. They're cheap, fast, and don't add an LLM dependency to your test pipeline. Only reach for an LLM-as-a-judge /blog/stop-ai-slop when the criterion is genuinely subjective enough that code can't capture it tone, helpfulness, hallucination detection . Run both on a sample of production traces. Add user feedback collection. Even a simple thumbs up/down on AI outputs is gold for debugging. Pair that signal with the trace and you can quickly find patterns in what users hate – PostHog surveys /surveys handle this in-app and tie responses back to the same user. Add session replay. Watching how users actually react to AI outputs reveals things no metric will. Did they retry? Edit? Copy-paste? Close the tab? See our guide on the best session replay tools /blog/best-session-replay-tools if you don't already have this. Stage 2: You have multiple AI features and a small team Move evals online. Your offline evals only cover inputs you defined. Online evals /blog/testing-ai-agents embracing-the-infinite-possibilities-of-your-agent run against a sample of live production traces 5-10% is plenty so you catch things your test set never imagined. Use cheap code-based evaluators for everything you can, and cheaper LLM judges for the subjective stuff. Start a trace review ritual. The PostHog teams working on AI features run a weekly Traces Hour /blog/testing-ai-agents building-evaluations-by-manually-reviewing-traces where they look at flagged traces low feedback scores, error spikes, support tickets, outlier clusters and turn the patterns into new evals. This is where the next generation of evaluators comes from. Connect LLM data to product analytics. "Users who hit our AI feature retain at 2x" is the kind of insight that justifies the whole investment. To do this, your LLM observability needs to share a user model with your product analytics /product-analytics – which is one reason bundled tools /blog/best-analytics-stack-vibe-coded-apps tend to win at this stage. Consider prompt management if it fits . Once more than one person edits prompts, or you want non-engineers to iterate without a deploy, prompt management /docs/prompt-management/prompts starts earning its keep. Many teams never need this – a prompts.py file in your repo is fine if you're shipping prompt changes through code anyway. Stage 3: You're scaling Add drift monitoring. Models change especially if you're on the latest version of an API . Inputs change. Your evals need to keep up. Track quality scores over time and alert on regressions. Add cost optimization workflows. Things like routing simpler requests to cheaper models, caching responses where you can, and identifying users or prompts that are eating budget. Add governance. If you're selling to enterprise, you'll need things like PII redaction, data residency, audit logs, and human-in-the-loop review queues for high-risk outputs. This is where dedicated tools start to earn their keep. The right tool for "day one" should also support most of "later" without needing to migrate – otherwise you're going to do this all again in six months. For a deeper comparison of tools available, see our roundup of the best open source LLM observability tools /blog/best-open-source-llm-observability-tools . And if you're earlier in your product journey, our guide to the best analytics stack for vibe-coded apps /blog/best-analytics-stack-vibe-coded-apps covers the whole analytics stack, not just the AI parts. Why PostHog works well for this A quick pitch, since this is our blog after all: PostHog's AI observability /ai-observability covers all of the above in one place, with a free tier 100K events/mo generous enough to get you well past MVP. You get: with a small SDK wrapper around OpenAI, Anthropic, and other providers Tracing /docs/ai-observability/traces Cost, latency, and token tracking by model, user, and featurewith LLM-as-a-judge and human review Evaluations /docs/ai-evals/evaluations alongside traces, so AI-specific failures tool calls, parsing errors, loop termination show up next to the calls that caused them Error tracking /error-tracking with versioning, fetch-at-runtime, and rollback Prompt management /docs/prompt-management/prompts – the same user IDs flow through, so you can correlate AI usage with retention, conversion, and churn Connected product analytics /product-analytics alongside traces, so you can see exactly what the user was doing when the model misbehaved Session replay /docs/ai-observability/link-session-replay An that lets you query traces, costs, and evals directly from Claude Code or Cursor while you're iterating MCP server /docs/model-context-protocol If you want to take it for a spin, you can start free / – no credit card needed. And if you'd rather skip the manual setup entirely, the PostHog Wizard /wizard handles instrumentation for you. Paste one command into Cursor or Claude Code and it detects your framework, installs the right SDK, wraps your client, and gets traces, cost data, and errors flowing in a few minutes. Install PostHog with one command Paste this into your terminal and make AI do all the work. Frequently asked questions How is AI observability different from regular APM? APM tools tell you whether your code is running and how fast. They can see that you made an API call to OpenAI and got a response back. They can't see what the model said, whether it was any good, or how much that specific call cost. AI observability adds the missing layer: the content of the calls, the quality of the outputs, and the economics of LLM usage. Do I need evals on day one? No. At MVP scale, manual review beats automated evals. Open your trace list once a day, read 10-20 random outputs, take notes. You'll learn more about what "good" means for your product in a week of doing this than you would from any rubric. Once you can't keep up with the volume, that's the signal to set up automated evals /blog/stop-ai-slop . How much should I expect to spend on observability as a solo founder? For most MVPs: nothing; even at a few thousand users, you can usually stay on a free or cheap plan. The thing that costs money is the LLM calls themselves – which is exactly why you want cost tracking from day one. What if I'm just calling OpenAI or Anthropic directly without a framework? That's actually the easiest case. Most observability tools have a wrapper around the OpenAI/Anthropic SDK that captures everything automatically. You don't need LangChain or any other framework to get full observability. Do I need separate tools for AI observability and product analytics? You can have separate tools, but you'll lose the most interesting insights – like whether your AI feature actually drives retention or conversion. Bundled platforms PostHog / being the obvious example share a user model across both, so you can ask questions like "do users who use the AI feature stick around longer?" without stitching data sources together. When should I switch from a free tier to a paid plan? When you hit one of three triggers: - Your event volume exceeds the free tier - You need a feature that's gated to paid plans often SSO, EU residency, longer retention - You need support SLAs to ship to enterprise. Otherwise, free tier might work fine well into post-PMF. What's the difference between observability and evaluation? Observability is about what happened – the traces, costs, latency, errors. It's descriptive. Evaluation is about whether what happened was good – scoring outputs against quality criteria. It's prescriptive. You need observability before evaluation makes sense: you can't score outputs you can't see. Subscribe to our newsletter Product for Engineers Read by 100,000+ founders and builders We'll share your email with Substack PostHog is an all-in-one developer platform for building successful products. We provide product analytics , web analytics , session replay , error tracking , feature flags , experiments , surveys , AI Observability , logs , workflows , endpoints , data warehouse , CDP , and an AI product assistant to help debug your code, ship features faster, and keep all your usage and customer data in one stack.