The Silent Cost of AI Agents: Why Your Next.js SaaS Is Burning Money on LLM Calls A developer describes how runaway LLM costs forced them to shut down a perfectly functioning AI pipeline that rewrote job descriptions at scale. The post identifies three common cost pitfalls in AI agent architecture: redundant context resends, lack of caching, and using expensive models for simple tasks. The developer provides code examples for prompt caching and a two-tier model routing approach to control costs. I had to kill a pipeline that was doing exactly what it was supposed to do. It was rewriting job descriptions at scale, improving SEO, and running without errors. The client asked me to shut it down anyway. The problem wasn't quality. It was cost. Running GPT-class models across a million listings added up faster than anyone expected. The pipeline worked perfectly, and that was the problem. Every perfect run cost money. Over time, the bill became the feature that mattered most. That moment changed how I think about AI agent architecture. Most teams building AI features into their Next.js SaaS focus on accuracy, latency, and user experience. They forget the fourth dimension: cost per action. And that's the one that kills projects. I've seen these patterns emerge across multiple projects. Here are the three that hurt most. Redundant context resends. Every time your agent calls the LLM, it sends the system prompt, the conversation history, and the user input. If you have 10 agents running in parallel for different users, you're sending the same system prompt 10 times. At scale, that's gigabytes of redundant tokens every hour. No caching strategy. Most teams treat every LLM call as unique. But many calls are identical or nearly identical. Same user query, same context, same expected output. Without caching, you pay full price for every duplicate. Expensive models for everything. GPT-4 is great for complex reasoning. It's terrible for simple classification, extraction, or rewriting. But most teams use one model for everything because it's easier to build that way. Easy to build, expensive to run. These three patterns are the reason so many AI features don't survive their first billing cycle. The pipeline I had to shut down suffered from all of them. The first fix is always prompt caching. If your system prompt is 2,000 tokens and you send it 100 times, that's 200,000 tokens of waste. Cache it. Here's a general pattern that works with any LLM provider that supports prompt caching. OpenAI and Anthropic both support it, and newer providers are adding it too. // Cache key based on prompt content, not just user identity function buildCacheKey systemPrompt: string, userInput: string : string { const hash = crypto.createHash 'sha256' .update systemPrompt + userInput .digest 'hex' ; return llm:${hash} ; } // Check cache before making the API call async function getCompletion systemPrompt: string, userInput: string, options: { useCache?: boolean; model?: string } = {} { if options.useCache { const key = buildCacheKey systemPrompt, userInput ; const cached = await redis.get key ; if cached return JSON.parse cached ; } const response = await openai.chat.completions.create { model: options.model || 'gpt-4o-mini', messages: { role: 'system', content: systemPrompt }, { role: 'user', content: userInput } } ; if options.useCache { const key = buildCacheKey systemPrompt, userInput ; await redis.setex key, 3600, JSON.stringify response ; } return response; } This isn't complicated. But most teams skip it because they don't think about caching until the bill arrives. By then, the damage is done. Not every LLM call needs the same horsepower. Classifying a job listing as remote or on-site is a trivial task. Extracting structured data from a legal document is not. A pattern that works well is a two-tier routing approach. Simple tasks go to a cheap model. Complex tasks go to an expensive one. The router itself is a cheap call that decides where to send the work. type TaskDifficulty = 'simple' | 'complex'; async function routeTask input: string : Promise