The Silent Cost of AI Agents: Why Your Next.js SaaS Is Burning Money on LLM Calls

A developer describes how runaway LLM costs forced them to shut down a perfectly functioning AI pipeline that rewrote job descriptions at scale. The post identifies three common cost pitfalls in AI agent architecture: redundant context resends, lack of caching, and using expensive models for simple tasks. The developer provides code examples for prompt caching and a two-tier model routing approach to control costs.

I had to kill a pipeline that was doing exactly what it was supposed to do. It was rewriting job descriptions at scale, improving SEO, and running without errors. The client asked me to shut it down anyway. The problem wasn't quality. It was cost. Running GPT-class models across a million listings added up faster than anyone expected. The pipeline worked perfectly, and that was the problem. Every perfect run cost money. Over time, the bill became the feature that mattered most. That moment changed how I think about AI agent architecture. Most teams building AI features into their Next.js SaaS focus on accuracy, latency, and user experience. They forget the fourth dimension: cost per action. And that's the one that kills projects. I've seen these patterns emerge across multiple projects. Here are the three that hurt most. Redundant context resends. Every time your agent calls the LLM, it sends the system prompt, the conversation history, and the user input. If you have 10 agents running in parallel for different users, you're sending the same system prompt 10 times. At scale, that's gigabytes of redundant tokens every hour. No caching strategy. Most teams treat every LLM call as unique. But many calls are identical or nearly identical. Same user query, same context, same expected output. Without caching, you pay full price for every duplicate. Expensive models for everything. GPT-4 is great for complex reasoning. It's terrible for simple classification, extraction, or rewriting. But most teams use one model for everything because it's easier to build that way. Easy to build, expensive to run. These three patterns are the reason so many AI features don't survive their first billing cycle. The pipeline I had to shut down suffered from all of them. The first fix is always prompt caching. If your system prompt is 2,000 tokens and you send it 100 times, that's 200,000 tokens of waste. Cache it. Here's a general pattern that works with any LLM provider that supports prompt caching. OpenAI and Anthropic both support it, and newer providers are adding it too. // Cache key based on prompt content, not just user identity function buildCacheKey systemPrompt: string, userInput: string : string { const hash = crypto.createHash 'sha256' .update systemPrompt + userInput .digest 'hex' ; return llm:${hash} ; } // Check cache before making the API call async function getCompletion systemPrompt: string, userInput: string, options: { useCache?: boolean; model?: string } = {} { if options.useCache { const key = buildCacheKey systemPrompt, userInput ; const cached = await redis.get key ; if cached return JSON.parse cached ; } const response = await openai.chat.completions.create { model: options.model || 'gpt-4o-mini', messages: { role: 'system', content: systemPrompt }, { role: 'user', content: userInput } } ; if options.useCache { const key = buildCacheKey systemPrompt, userInput ; await redis.setex key, 3600, JSON.stringify response ; } return response; } This isn't complicated. But most teams skip it because they don't think about caching until the bill arrives. By then, the damage is done. Not every LLM call needs the same horsepower. Classifying a job listing as remote or on-site is a trivial task. Extracting structured data from a legal document is not. A pattern that works well is a two-tier routing approach. Simple tasks go to a cheap model. Complex tasks go to an expensive one. The router itself is a cheap call that decides where to send the work. type TaskDifficulty = 'simple' | 'complex'; async function routeTask input: string : Promise<TaskDifficulty { // A quick, cheap call to classify the task const classification = await cheapModel { messages: { role: 'system', content: 'Classify this task as simple or complex. Respond with one word.' }, { role: 'user', content: input } , max tokens: 5 } ; return classification.includes 'complex' ? 'complex' : 'simple'; } async function processWithFallback input: string { const difficulty = await routeTask input ; const model = difficulty === 'simple' ? 'gpt-4o-mini' // $0.15 per million input tokens : 'gpt-4o'; // $2.50 per million input tokens // That's a 16x price difference for the same task return openai.chat.completions.create { model, messages: { role: 'user', content: input } } ; } For the job description rewrite pipeline that got shut down, this pattern alone could have made a significant difference. I evaluated DeepSeek V4 Flash as a replacement at roughly 23x cheaper than GPT-4.1 with sufficient quality for the task. The pipeline could have stayed alive with better routing. The most expensive LLM call is the one that gives you bad output and forces a retry. If your agent returns malformed JSON, you pay again to fix it. If it hallucinates a field, you pay to regenerate. Function calling with strict JSON schemas prevents this. The model either returns valid data or nothing. No partial outputs, no parsing errors, no retry loops. js const extractionSchema = { name: 'extract job details', description: 'Extract structured job information from raw text', parameters: { type: 'object', properties: { title: { type: 'string' }, company: { type: 'string' }, salary range: { type: 'object', properties: { min: { type: 'number' }, max: { type: 'number' }, currency: { type: 'string' } }, required: 'min', 'max', 'currency' }, remote: { type: 'boolean' } }, required: 'title', 'company', 'remote' } }; const response = await openai.chat.completions.create { model: 'gpt-4o-mini', messages: { role: 'user', content: rawJobText } , functions: extractionSchema , function call: { name: 'extract job details' } } ; This pattern eliminated retries in the AI Resume Tailor I built. The model either returns valid structured data or fails cleanly. No hallucinated fields, no broken downstream pipelines. Every retry you avoid is money you keep. The teams I see that succeed with AI features treat cost as a first-class constraint, not an afterthought. They design their agent architecture knowing exactly how much each action costs. They cache aggressively. They route intelligently. They validate output before paying for retries. The pipeline I had to shut down would have survived with better architecture from the start. Those three patterns redudant context, no caching, expensive models for everything are exactly what killed it. If your team is building AI features into a Next.js SaaS and the costs are climbing faster than the value, that's the kind of problem I help with. Happy to compare notes on designing an agent architecture that doesn't burn your budget. Written by Abdul Rehman, full-stack AI engineer building production SaaS, MVPs, and AI automation. More at PrimeStrides.