I reviewed six "operator-ready" checklists for AI agents. None of them define the problem correctly.

A developer reviewed six 'operator-ready' checklists for AI agents and found that none define the problem correctly. The frameworks, including Anthropic's 'Building Effective Agents' and Hamel Husain's eval guide, treat reliability as a static pass-rate on a test set, ignoring distribution shift and operator-specific data. The developer argues that operator-readiness requires ongoing monitoring and disaggregation of failure modes, not just pre-deployment eval scores.

The industry has converged on a definition of "operator-ready" that is measurable, deployable, and wrong. The most cited frameworks, Anthropic's "Building Effective Agents" December 2024 , Hamel Husain's "Your AI product needs evals" 2024 , the LangChain eval documentation, NIST AI RMF 2023 , Google's responsible AI practices, OpenAI's model specification May 2024 , share a common structure. They define reliability as pass-rate on a test set. They define readiness as a threshold on that pass-rate. This is a reasonable definition for production-readiness. It is not a correct definition for operator-readiness. The distinction is not semantic. It has direct consequences for how you test, what you ship, and what breaks after handoff. Hamel Husain's framework is the most practically useful of the six. His argument that "you cannot improve what you cannot measure" is correct, and his guidance on building eval sets that are representative, diverse, and graded with real human judgment is solid. The Anthropic guide's emphasis on minimal footprint and clear failure modes is well-reasoned for the agent design phase. The NIST AI RMF is the most complete risk taxonomy. Its four functions Govern, Map, Measure, Manage are a useful organizational structure for compliance-conscious deployments. These are good frameworks. The problem is not that they're wrong. The problem is that they're answering a different question than the one that breaks things in production. Every framework I reviewed defines reliability as a static property: "the agent achieves X% on the eval set." That's a snapshot metric, not a deployment metric. Operators change things. They add new document types. They expand use cases. They bring their own data. Their users find inputs that your test set never anticipated. The real question is not "does the agent achieve X% on the eval set." It's "does the agent maintain X% after six weeks of operator usage, on the operator's actual input distribution." Those are different questions. The first is testable before deployment. The second requires a different kind of eval infrastructure. Every literature source I checked treats distribution shift as an edge case to handle. It is not an edge case. It is the default condition of operator deployment. The operator's data is never exactly your eval data. The operator's users will find inputs you did not anticipate. The operator's business context will evolve. Distribution shift is not a risk you mitigate and move past. It's the ongoing condition of every production deployment. A framework that doesn't include ongoing distribution shift monitoring as a first-class readiness requirement is describing a static artifact, not a live system. An eval pass-rate in the low nineties can coexist with an operator error rate several times higher on real data. I've seen this in practice across multiple deployments. The reason is that pass-rate is an aggregate measure that hides the variance of failure types. There are at least four different failure modes that look identical in a pass-rate number: A team with 94% pass-rate and mostly formatting failures is in a very different position from a team with 94% pass-rate and mostly silent content errors. The number looks the same. Operator-readiness requires disaggregating the failure modes, not aggregating them into a single score. The Anthropic guide correctly notes that agents should be evaluated with "realistic and diverse inputs." It does not address what happens when the operator's inputs are more diverse than your test set in ways you couldn't have anticipated. This is the eval-to-deployment gap, and it is structural. You build the eval set with the data you have. The operator deploys with the data they have. Those two sets overlap imperfectly. The only way to close this gap is to treat pre-deployment testing on the operator's own corpus as a mandatory step. Not as a quality assurance nicety. As a readiness gate. Fifty documents from the operator's actual corpus, reviewed manually, compared to the eval pass-rate on the same task. If the accuracy on those fifty documents is materially lower than the eval accuracy, the deployment is not ready regardless of what the aggregate pass-rate says. An agent is operator-ready when: This is more work than "run the eval suite, check the score." It is also the actual test for whether the agent will maintain its quality guarantees six weeks after handoff. What's the fastest way to close the eval-to-deployment gap? Before handoff, run the agent on 50 documents from the operator's actual corpus. Not synthetic data, not your training set. Documents the operator will actually send. Review those outputs manually. Compare the accuracy to your eval accuracy on the same task. If the gap is more than 5 percentage points, you have a distribution shift problem to characterize before deploying. Is there a practical threshold for operator-readiness? There isn't a universal threshold. The right threshold depends on the stakes of failure. For a low-stakes use case content categorization , 85% might be acceptable. For a high-stakes use case contract extraction with legal consequences , 85% probably isn't. What matters more than the threshold is that you're measuring operator accuracy on the operator's data rather than eval accuracy on your data . Those are different denominators. Does ongoing monitoring replace pre-deployment testing? No. Monitoring catches degradation after deployment. Pre-deployment testing on the operator's corpus catches distribution shift before deployment. Both are necessary. Neither substitutes for the other. What about fine-tuning on the operator's data? Fine-tuning is the right long-term answer for persistent distribution shift. It's not the short-term answer for a deployment that needs to go live in two weeks. The short-term answer is: characterize the gap, document the failure modes, set up monitoring, and be transparent with the operator about the limitations on their data distribution. The hardest problem I haven't seen solved well: how do you define "operator-ready" for an agent that will serve multiple operators with fundamentally different data distributions? A financial services operator and a healthcare operator running the same document-extraction agent have different input distributions, different failure modes, and different acceptable error rates. Per-operator eval sets are the correct answer in theory. They're expensive in practice. Is there a reasonable way to stratify a single eval set across operator types without running a full per-operator measurement pass? I've seen teams try domain-stratified sampling one slice per major input type , but the slice sizes are never large enough to give statistically stable estimates for the rare input categories that are most likely to cause problems. If you've solved this problem, I'd be interested in how.