AI Metrics Baseline: Prove Your Feature Works Before Scaling It An engineer argues that AI features require a metrics baseline—a small set of before-and-after measurements—to determine if a workflow is improving, degrading, or just costing more. The baseline should track cost per successful task, quality via deterministic checks and rubrics, and step-level reliability for agentic workflows. Without such baselines, production decisions become opinion-driven rather than data-driven. An AI feature can feel impressive and still be a bad product decision. The demo is fast. The answer sounds useful. The team is excited. Then usage grows and nobody can answer the basic questions: Is it accurate enough? Is it saving time? Which customers trust it? Why did costs spike? Should we scale it, fix it, or kill it? That is the trap an AI metrics baseline prevents. A baseline is not a dashboard full of vanity charts. It is a small set of before-and-after measurements that tells you whether an AI workflow is getting better, getting worse, or merely getting more expensive. Most software teams already track uptime, errors, and conversion. AI features need those too, but they also need new signals because model behavior is probabilistic. A normal API either returns the expected response or throws an error. An AI workflow can return: Without a baseline, every production discussion becomes opinion-driven: "The model seems better." "Users like it." "The new prompt reduced hallucinations." "The expensive model is worth it." Maybe. Maybe not. The baseline turns those claims into measurable comparisons. An AI metrics baseline is the starting measurement for the workflow before you optimize or scale it. It answers five questions: You do not need 80 metrics on day one. You need a small set of metrics that match the feature's risk and purpose. For example: | Feature | Useful baseline | |---|---| | Support answer bot | resolution rate, citation quality, escalation rate, cost per resolved issue | | Sales email assistant | acceptance rate, edit distance, reply rate, generation latency | | Internal coding agent | task completion rate, test pass rate, review changes, cost per merged task | | Document extraction | field accuracy, manual correction time, retry rate, confidence calibration | | RAG search | answer groundedness, retrieval precision, no-answer accuracy, source freshness | The goal is not measurement theatre. The goal is decision clarity. Start with five categories. Pick one or two metrics from each. AI cost is not just model tokens. It includes retries, tool calls, vector database reads, reranking, logging, human review, failed jobs, and premium model fallbacks. Track at least: A cheap request can still be expensive if it fails often. A costly request can be acceptable if it completes a high-value workflow. Use this formula as a starting point: cost per successful task = total ai workflow cost / successful task count Then split the numerator: total ai workflow cost = model cost + tool cost + retrieval cost + review cost + retry cost This is where many teams get surprised. The model call may not be the biggest cost after you add retries, background enrichment, and review queues. Quality depends on the feature. Do not use one generic "AI accuracy" score for everything. For a RAG answer, measure: For an agent, measure: For extraction, measure: A simple rubric helps. Here is one you can adapt: { "score": 4, "max score": 5, "checks": { "answers user question": true, "uses correct sources": true, "avoids unsupported claims": true, "follows format": true, "needs human fix": false }, "notes": "Correct answer with good source support. Minor wording cleanup only." } Do not rely only on model-as-judge scoring. Use deterministic checks where possible: schema validation, citation existence, database constraints, test pass/fail, and human review samples. A feature that works 70% of the time is not production-ready just because the successful runs look magical. Track: For agentic workflows, step-level reliability matters more than overall success. If the agent performs retrieval, planning, tool execution, validation, and final response generation, log each step separately. Example event shape: { "workflow id": "wf 7x92", "tenant id": "tenant 123", "step": "tool execution", "tool": "create invoice draft", "status": "failed", "error type": "invalid tool args", "duration ms": 1840, "model": "gpt-5.5-mini", "attempt": 2 } This lets you see whether the problem is the model, retrieval, tools, permissions, latency, or your own validation layer. A technically strong feature can still fail because users do not trust it or do not need it. Track: For workflow tools, "accepted output" is often more useful than "generated output." If your AI writes a reply and the user rewrites 80% of it, the generation was not truly successful. A practical metric: useful output rate = accepted outputs / total outputs A better metric: trusted output rate = accepted outputs without major edit / total outputs This catches the difference between novelty usage and durable product value. This is the layer many AI dashboards skip. Ask: what job is this feature supposed to improve? Possible metrics: Be careful. Do not attribute every change to AI. Use comparisons where possible: The business metric prevents the team from optimizing for beautiful model scores that do not matter. Prompt changes are easy. Measurement is harder. That is why teams often rewrite prompts first. Resist that urge. Before changing the model, prompt, retrieval strategy, or tool chain, capture a baseline run. Even a small sample is better than nothing. Minimum baseline process: Your baseline record can be simple: { "baseline id": "support answer bot v0", "workflow": "support answer generation", "date": "2026-07-01", "dataset": "support questions sample 120", "prompt version": "support prompt 14", "retrieval version": "kb rag 3", "model": "primary model name", "metrics": { "avg cost per request usd": 0.018, "p95 latency ms": 7200, "grounded answer rate": 0.81, "citation error rate": 0.09, "human fix required rate": 0.22, "workflow success rate": 0.93 } } Now every improvement has something to beat. A common mistake is logging only the final prompt and response. That is not enough. AI product quality is shaped by the full workflow: You need trace IDs across those steps. A simple TypeScript example: type AiMetricEvent = { traceId: string; tenantId: string; workflow: string; step: string; status: "ok" | "failed" | "skipped"; durationMs: number; costUsd?: number; model?: string; promptVersion?: string; outputVersion?: string; errorType?: string; metadata?: Record