The AI Implementation Process I Use With Every Client An engineer outlines a five-phase AI implementation process used with clients: scoping, proof of concept, integration, evaluation, and operations. Each phase has an exit criterion that must be met before proceeding, with the goal of avoiding common project failures. The process emphasizes adversarial testing, idempotency keys, approval queues, and token budget alerts. Most AI projects do not fail at the model. They fail in the six weeks before anyone writes a prompt, and in the six weeks after the demo lands in a Slack channel and nobody knows who owns it. I have run enough of these now from one-off automations to multi-agent content systems running unattended that the process has converged into something stable. This is the version I actually use. It has five phases: scoping, POC, integration, evaluation, operations. Each phase has an exit criterion. If we cannot meet the exit criterion, we do not move forward. That single rule has saved more projects than any clever architecture choice. Scoping ends with a written document that names the workflow being automated, the system of record it touches, the success metric in hours or dollars, the data we have access to, and the smallest possible first slice. No model is chosen yet. No code is written. If we cannot produce that document, the engagement stops here and the client keeps the document. The hardest part of scoping is resisting the urge to solve the interesting problem. Clients almost always describe the AI-shaped fantasy "an agent that handles all support tickets" when the real opportunity is narrower and uglier "triage tier-1 tickets that mention billing, route to the right queue, draft a reply for human approval" . The narrower version ships. The fantasy does not. I run scoping as three sessions: Exit criterion: a one-page scope with a single first slice, a measurable success metric, and a named human owner on the client side. No owner, no project. The POC has one job: kill the project cheaply if it cannot work. I treat the POC as adversarial. I am trying to find the reason this will not ship, before we spend integration money on it. Concretely, a POC for me looks like this: The POC answers four questions in order: | Question | What "no" means | |---|---| | Does the model produce the right shape of output reliably? | Schema issues, structured-output failures. Fixable. | Does it produce the right content on easy cases? | Capability gap. Sometimes fixable with retrieval or examples. | | Does it handle the long tail without catastrophic failures? | The real risk. Often the project killer. | | Can we detect when it is wrong? | If no, the project cannot ship to production. Full stop. | That last question is the one most people skip. An AI system you cannot evaluate is an AI system you cannot trust, and an AI system you cannot trust is a demo, not a product. I have walked away from POCs that worked 90% of the time because there was no signal to catch the 10%. Exit criterion: measurable performance on the eval set that the client agrees is good enough to justify integration cost, plus a documented failure mode list. This is where most of the actual work lives, and where most of my time goes. The model is usually the easy part by now. The integration is what makes it real. My default stack for production AI work: Three integration details I now treat as non-negotiable: Any external action send email, create ticket, post to CRM gets an idempotency key derived from the input. Retries are inevitable, duplicate side effects are not. php def idempotency key workflow id: str, input hash: str, step: str - str: return f"{workflow id}:{step}:{input hash}" I always build the approval queue before I build the auto-send. Even if the client wants full automation eventually, shipping with human review for the first 2 to 4 weeks catches the failure modes the eval set missed. Turning approval off later is one config change. Token budgets per execution, hard cutoffs, alerts at 50/80/100% of monthly budget. I have seen a single retry loop burn $400 in an hour. Never again. Exit criterion: the system runs end to end on real production data, with logging, retries, idempotency, and a kill switch. Not perfect outputs yet, but the pipes are sound. Evaluation is not a phase you finish. It is a system you build once and keep running forever. But there is a discrete block of work to set it up, and that is what this phase is. I build three layers of evaluation: The trap here is treating eval as a one-time gate. Models change. Prompts drift. Data shifts. The eval set has to be re-run on every change and the production telemetry has to feed back into growing the eval set. If a real production failure happens, it goes into the eval set the same day. Exit criterion: the client can answer "is the system still working correctly?" without calling me. This is the phase that separates a project that survives from one that dies six months in when something breaks and nobody knows where to look. What I deliver in operations: A few opinions, after running this loop enough times: The shape of this process is not unique to my work. What is mine is the calibration: which phases I now know to invest in, which exit criteria I refuse to skip, and which mistakes I have made enough times to write them down. That last category is the actual deliverable when you hire someone like me, more than the code. If you are scoping an AI implementation and want a second pair of eyes on it before you commit budget, I am happy to look at it. Reach out at lazar-milicevic.com/ contact https://lazar-milicevic.com/ contact , or browse the rest of the blog for more on evaluation, RAG, and getting agents into production.