I watched a GPT-4 function call that worked perfectly in the playground silently fail in production. The output was valid JSON. The structure was correct. But the content was a hallucination. It took me two hours to notice the problem, and another day to build a guardrail that caught it.
That moment taught me something I now tell every founder I work with. The first 20% of an AI feature takes 20% of the time. The last 20% takes 80%. Not because the code is hard. Because the code is the easy part. The hard part is everything around it.
I've built production AI pipelines at scale. A job board platform that scores 10,000+ listings daily with LLM function calling. An AI resume tailor that generates dozens of tailored resumes in a single session. A meeting assistant with real-time transcription. Every single project followed the same pattern. The demo worked in an hour. The production system took weeks.
Here's what that last 80% actually costs.
Your first prompt will work on your test cases. It will fail on real data.
On the job board platform, I wrote a GPT-4 prompt to extract structured fields from raw job descriptions. Company name, title, location, skills. The playground gave me perfect results. I deployed it.
Within an hour, the pipeline started producing garbage. One listing had a title that was 400 characters long because the ATS source concatenated the job title with the department. Another had a location field that said "Remote - Must be willing to travel to Chicago quarterly." My prompt assumed location was a single city name.
The fix wasn't a better prompt. It was a layered approach. I added input validation before the LLM call, truncating fields that exceeded reasonable lengths. I added output validation after, checking that extracted fields matched expected patterns. I built a retry mechanism that flagged low-confidence outputs for manual review.
Here's the pattern I now use for every LLM extraction task:
const extractionSchema = {
type: "object",
properties: {
title: { type: "string", maxLength: 200 },
company: { type: "string", maxLength: 100 },
location: { type: "string", maxLength: 150 },
skills: {
type: "array",
items: { type: "string", maxLength: 50 },
maxItems: 20
}
},
required: ["title", "company"]
};
const result = await callLLMWithValidation(prompt, extractionSchema);
if (!result.valid) {
// Fall back to a simpler extraction or flag for review
queueForManualReview(result.raw);
}
The schema enforces structure. The validation catches surprises. This single pattern eliminated 90% of the extraction errors. But it took three iterations to get right.
Your AI feature will cost more than you expect. Not because the API is expensive per call. Because you will make more calls than you planned.
I built an AI-powered job description rewrite pipeline to improve SEO for 1M+ listings. The prototype cost pennies. The production cost was eye-watering. The pipeline was shut down because the LLM costs at scale were unsustainable. Every rewrite consumed tokens. At 1M listings, the math didn't work.
That experience taught me to think about cost from day one. Not after the feature ships.
The solutions I've seen work:
On the resume tailor project, I used GPT-4o-mini for the bulk generation pipeline. The quality was indistinguishable from GPT-4 for that specific task. The cost was a fraction. The trick is knowing which tasks need the expensive model and which don't.
LLMs are non-deterministic. They time out. They hit rate limits. They return empty responses. They hallucinate. Your code needs to handle all of these.
On the job board platform, the scoring pipeline runs 10,000 jobs daily. OpenAI's API occasionally returns a 429 rate limit error. Occasionally a request times out. Occasionally the model returns a response that doesn't match the function schema.
I built a retry layer with exponential backoff. I added a circuit breaker that s the pipeline if error rates spike. I wrote a fallback that uses a simpler model (GPT-4o-mini) when the primary model fails. The system now handles failures gracefully without human intervention.
async function callWithRetry(prompt: string, maxRetries = 3): Promise<LLMResponse> {
for (let attempt = 1; attempt <= maxRetries; attempt++) {
try {
return await callLLM(prompt);
} catch (error) {
if (attempt === maxRetries) throw error;
const delay = Math.pow(2, attempt) * 1000;
await sleep(delay);
}
}
}
This code is simple. But the decision of when to retry and when to fail is not. Retry on rate limits. Fail fast on invalid input. The distinction matters.
How do you know your AI feature is working correctly in production?
The answer is not "check a few examples". The answer is automated evaluation.
On the resume tailor project, I built an anti-hallucination schema using conditional presence flags. The LLM outputs a has_*
guard for every field. If has_company
is false, the company
field must be null. This prevents the model from fabricating information. Every output is validated against this schema before it reaches the user.
For the job board scoring pipeline, I built a quality gate that samples 1% of scored listings daily and compares them against manual reviews. If the error rate exceeds a threshold, an alert fires. This catches drift before it affects users.
Evaluation is the part of the last 20% that most teams skip. They ship the feature, test it manually, and call it done. Then three weeks later, users start complaining that the AI is getting worse. The model updated. The data changed. The prompt degraded. Without evaluation, you won't know until it's too late.
The last 20% of your AI feature is not a bug fix. It is a system.
You need input validation, output validation, retry logic, cost monitoring, model selection, caching, evaluation, and drift detection. You need to handle edge cases you haven't imagined yet. You need to know when your AI is wrong.
I've built these systems across multiple projects. The job board platform processes 10,000 listings daily without manual oversight. The resume tailor generates dozens of tailored resumes in parallel with zero hallucinated data. The meeting assistant streams real-time transcription and analysis. Every one of these required the last 80%.
If your team is wrestling with AI features that work in a demo but fail in production, that's the kind of thing I help with. Happy to compare notes.
Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.