{"slug": "how-i-cut-my-ai-bill-by-caching-llm-responses-in-node-js", "title": "How I Cut My AI Bill by Caching LLM Responses in Node.js", "summary": "Llm-cacher**, an open-source Node.js caching library for LLM API calls, after realizing that repeated or similar prompts during testing and production lead to unnecessary costs. The library offers zero-code-change integration, multiple storage backends, and semantic matching to cache responses for near-identical prompts, distinguishing itself from alternatives like LangChain's built-in cache, SaaS proxies (Helicone, Portkey), Python-based GPTCache, and Upstash's managed service.", "body_md": "# I built an LLM caching library to test what AI-assisted development actually looks like\n\nI've been spending my evenings on a personal side project — just learning by building. The latest experiment was wiring up an AI agent into it.\n\nWhile testing, I caught myself sending almost the same prompts over and over. Same intent, slightly different wording. And every test run cost me real money.\n\nThen a thought hit me: if I'm doing this while testing, **real users in production absolutely will too**. The first 1000 users of any AI chatbot mostly ask the same handful of questions. The LLM charges you for every single one.\n\nI looked for a good caching solution and didn't find one that ticked all my boxes. So I built [ llm-cacher](https://www.npmjs.com/package/llm-cacher) — and used it as an excuse to try something I hadn't done before: work with an AI assistant as a real collaborator throughout the entire build. I'd drive, it would implement, and I'd review everything that came out.\n\nHere's what almost every LLM integration looks like:\n\n``` js\nconst openai = new OpenAI();\n\nasync function summarize(text: string) {\n  const res = await openai.chat.completions.create({\n    model: \"gpt-4o\",\n    messages: [\n      { role: \"system\", content: \"Summarize the following text.\" },\n      { role: \"user\", content: text },\n    ],\n  });\n  return res.choices[0].message.content;\n}\n```\n\nIf `summarize()`\n\ngets called with the same `text`\n\ntwice, you pay twice. Run an eval suite a hundred times? Pay a hundred times.\n\nYou could roll your own cache:\n\n``` js\nconst cache = new Map();\n\nasync function summarize(text: string) {\n  if (cache.has(text)) return cache.get(text);\n  // ...\n}\n```\n\nBut now you're maintaining cache logic for every API call. And it only handles **exact** matches — \"Summarize this article\" and \"Summarize this article please\" become different cache keys, even though the model returns essentially the same answer.\n\nThat's the gap I wanted to close. Three things drove the design: zero code changes to existing code, multiple storage backends to fit any stack, and semantic matching so near-identical prompts share the same cache entry.\n\nWhat started as \"just cache the response\" turned out to be more involved than I expected — streaming, semantic search, distributed storage, and index management each brought their own surprises.\n\n## Who is this for\n\nThere are a few other caching options in the Node.js ecosystem worth knowing about:\n\n**LangChain.js** has built-in caching, but only if you write your entire integration against the LangChain abstraction layer. If you're already using it — great, use theirs. If you're not, adopting LangChain just for caching is a lot.\n\n**Helicone** and **Portkey** are SaaS proxies that include caching as part of a broader observability platform. If you need cost tracking, rate limiting, and request logging alongside caching, they're worth looking at. The trade-off is that your requests go through their servers.\n\n**GPTCache** is the closest open-source equivalent with semantic caching, but it's Python-first and runs as a Docker sidecar — not a direct npm install.\n\n**Upstash Semantic Cache** is a JavaScript SDK with semantic caching, but it's tied to Upstash's managed service.\n\n**Anthropic's built-in prompt caching** is worth mentioning separately because it's easy to confuse with what llm-cacher does. Anthropic's feature caches the model's internal state for long system prompts, reducing the cost of re-processing repeated prefixes. llm-cacher caches the full response. They're complementary — you can use both.\n\n`llm-cacher`\n\nis for when you want self-hosted caching, you're using the OpenAI or Anthropic SDK directly, and you don't want to adopt a framework or sign up for a service to get there.\n\n## Quick start\n\n```\nnpm install llm-cacher\npython\nimport OpenAI from \"openai\";\nimport { createCachedClient } from \"llm-cacher\";\n\nconst openai = createCachedClient(new OpenAI(), {\n  ttl: \"24h\",\n  storage: \"memory\",\n});\n\n// First call hits the API\nconst res1 = await openai.chat.completions.create({\n  model: \"gpt-4o\",\n  messages: [{ role: \"user\", content: \"What is 2+2?\" }],\n});\n\n// Second identical call is served from cache instantly\nconst res2 = await openai.chat.completions.create({\n  model: \"gpt-4o\",\n  messages: [{ role: \"user\", content: \"What is 2+2?\" }],\n});\n```\n\n`createCachedClient`\n\nreturns a Proxy with the same TypeScript type as the original client. The rest of your code stays identical.\n\n## How it works under the hood\n\nThe cache key is a **SHA-256 hash** of the request parameters: model, messages, temperature, top_p, and so on. The `stream`\n\nflag is excluded from the key, so streaming and non-streaming calls for the same request share the same cache key.\n\nWhen a streaming request is cached, the chunks are accumulated and stored as a list. On a cache hit, they're replayed as an `AsyncGenerator`\n\n— your `for await`\n\nloop never knows the difference:\n\n``` js\nconst stream = await openai.chat.completions.create({\n  model: 'gpt-4o',\n  messages: [...],\n  stream: true,\n})\n\nfor await (const chunk of stream) {\n  process.stdout.write(chunk.choices[0]?.delta?.content ?? '')\n}\n// Works whether the response came from the API or from cache\n```\n\n## Storage backends\n\n```\n// Memory — default, zero deps\ncreateCachedClient(client, { storage: \"memory\", maxSize: 500 });\n\n// File — useful for CI and local dev\ncreateCachedClient(client, { storage: \"file\", storagePath: \"./cache.json\" });\n\n// SQLite — great for single-process apps\nimport { SQLiteStorage } from \"llm-cacher\";\ncreateCachedClient(client, {\n  storage: new SQLiteStorage({ path: \"./cache.db\" }),\n});\n\n// Redis — for multi-instance production\nimport { RedisStorage } from \"llm-cacher\";\nimport Redis from \"ioredis\";\ncreateCachedClient(client, {\n  storage: new RedisStorage({ client: new Redis() }),\n});\n\n// DynamoDB — for serverless\nimport { DynamoDBStorage } from \"llm-cacher\";\ncreateCachedClient(client, {\n  storage: new DynamoDBStorage({ tableName: \"llm-cache\", region: \"us-east-1\" }),\n});\n```\n\nThe backends aren't interchangeable — each fits a specific environment. Memory for tests, SQLite when you need persistence without a server, Redis for multi-instance production, DynamoDB when you're serverless and want expiry handled at the infrastructure level. All backends are optional peer dependencies, so you only install what you actually use.\n\n## Semantic caching\n\nExact-match caching misses a lot of real-world hits. Consider:\n\n```\n\"Summarize this article.\"\n\"Summarize the article above.\"\n\"Can you summarize this article please?\"\n```\n\nTo a hash function, these are three completely different requests. To the model, the outputs are nearly identical.\n\n`llm-cacher`\n\nsolves this by computing **embeddings** for each prompt and comparing them with cosine similarity. If the similarity is above your threshold, it's a cache hit.\n\n``` js\nimport { LocalEmbedder } from \"llm-cacher\";\n\nconst openai = createCachedClient(new OpenAI(), {\n  storage: \"sqlite\",\n  semantic: {\n    embedder: new LocalEmbedder(), // ~25MB model, runs locally, no API key\n    threshold: 0.92, // higher = stricter matching\n  },\n});\n```\n\n`LocalEmbedder`\n\nuses `all-MiniLM-L6-v2`\n\nvia `@huggingface/transformers`\n\n. No API key, no extra cost. For higher accuracy, you can switch to OpenAI embeddings:\n\n``` js\nimport { OpenAIEmbedder } from 'llm-cacher'\n\nsemantic: {\n  embedder: new OpenAIEmbedder({ client: new OpenAI() }),\n  threshold: 0.95,\n  indexType: 'hnsw', // O(log n) lookup for large caches\n}\n```\n\nBy default, similarity search does a linear scan across all cached embeddings — which is fine for most use cases. If your cache grows into the tens of thousands of unique entries, `indexType: 'hnsw'`\n\nswitches to an HNSW graph index and keeps lookups fast.\n\n## Framework integrations\n\nEach framework has its own way of sharing state across requests — Express augments `req`\n\n, Hono uses typed context variables, NestJS uses dependency injection. The integrations follow those conventions so `withCache`\n\nfeels native to whatever stack you're in.\n\n### NestJS\n\n```\n// app.module.ts\n@Module({\n  imports: [\n    LlmCacheModule.forRoot({\n      ttl: \"24h\",\n      storage: new RedisStorage({ client: new Redis() }),\n    }),\n  ],\n})\nexport class AppModule {}\n\n// chat.service.ts\n@Injectable()\nexport class ChatService {\n  private readonly openai: OpenAI;\n\n  constructor(@InjectLlmCache() private readonly llmCache: LlmCacheService) {\n    this.openai = this.llmCache.wrap(new OpenAI());\n  }\n}\n```\n\n### Express\n\n```\napp.use(llmCacheMiddleware({ ttl: \"24h\", storage: \"memory\" }));\n\napp.post(\"/chat\", async (req, res) => {\n  const openai = req.withCache(new OpenAI());\n  // ...\n});\n```\n\n### Hono\n\n```\napp.use(llmCacheMiddleware({ ttl: \"24h\", storage: \"sqlite\" }));\n\napp.post(\"/chat\", async (c) => {\n  const openai = c.get(\"withCache\")(new OpenAI());\n  // ...\n});\n```\n\n## Things I learned along the way\n\n**Streaming is harder to cache than it looks.** You can't just intercept the response — you have to yield each chunk to the caller in real time while simultaneously collecting them into an array. And if storage fails after the stream has fully delivered, you can't throw: the caller already received their data. That's why the `set()`\n\ncall after a stream ends is wrapped in `.catch(() => undefined)`\n\n. It's not lazy error handling — it's deliberate. A storage failure at that point is not the caller's problem.\n\n**The similarity index needs active cleanup.** When a cache entry expires in storage, its embedding stays in the in-memory index indefinitely if you don't do anything about it. Left unchecked, the index keeps growing and starts returning keys that no longer exist in storage. The fix is to remove the key from the index whenever a `get()`\n\nreturns null — whether that's on a direct lookup or after a semantic match comes back empty.\n\n**HNSW doesn't delete — it marks.** `hnswlib`\n\ndoesn't support removing a vector from the index outright. Instead it uses `markDelete()`\n\n, which flags the entry but leaves it in memory. To reclaim those slots, you track a `deletedCount`\n\nand pass `replaceDeleted: true`\n\non the next `addPoint()`\n\n— which lets the library reuse a marked slot instead of allocating a new one. It's not obvious from the docs and easy to get wrong.\n\n**Proxy over a class wrapper.** The obvious approach to wrapping an SDK is subclassing or a decorator class. The problem: you'd have to declare every method statically, and the return type would diverge from the original. A `Proxy`\n\nintercepts only `chat.completions.create`\n\nand passes everything else through to the real client untouched — so the TypeScript type stays identical to the original `OpenAI`\n\ninstance. No re-declarations, no type casting.\n\n## On using AI to build this\n\nOne of my goals going into this was to test how far an AI assistant could get without much hand-holding — give it a direction, see what it produces.\n\nThe honest answer: it produced a lot of code quickly, and a lot of that code had bugs. Not obvious crashes, but subtle logic errors — a TTL check that was off by up to a second, a mock in a test that never actually exercised the code it was supposed to test, edge cases in the similarity index that only showed up when I read the implementation carefully. Each one took me sitting down, understanding what the code was doing, and explaining back to the AI where it went wrong.\n\nUsing an AI assistant genuinely speeds up development — I wouldn't have built this as fast on my own. But the speed only works if you understand what it's generating. If you accept the output without reading it, the bugs ship with the code. The AI is confident whether it's right or wrong, and it's on you to tell the difference.\n\nI'd use it again. But I'd go in knowing that \"review everything that comes out\" isn't optional. And also, I suggest using it in a virtual machine with lower access rights.\n\n## What's next\n\n- Cost tracking — show how much you've saved compared to always hitting the API\n- A dashboard for inspecting cache contents and hit rates\n- Gemini and Mistral adapters\n\nIf any of this sounds useful, or you want something completely different, open an issue — I'm genuinely open to feedback on direction.\n\n## Try it out\n\n```\nnpm install llm-cacher\n```\n\nIf you hit a weird edge case or want to plug in a new storage backend, PRs are welcome.", "url": "https://wpnews.pro/news/how-i-cut-my-ai-bill-by-caching-llm-responses-in-node-js", "canonical_source": "https://dev.to/yaroslav-solo/how-i-cut-my-ai-bill-by-caching-llm-responses-in-nodejs-3430", "published_at": "2026-05-20 13:47:56+00:00", "updated_at": "2026-05-20 14:06:43.006084+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "developer-tools", "open-source"], "entities": ["OpenAI", "GPT-4o", "llm-cacher", "Node.js"], "alternates": {"html": "https://wpnews.pro/news/how-i-cut-my-ai-bill-by-caching-llm-responses-in-node-js", "markdown": "https://wpnews.pro/news/how-i-cut-my-ai-bill-by-caching-llm-responses-in-node-js.md", "text": "https://wpnews.pro/news/how-i-cut-my-ai-bill-by-caching-llm-responses-in-node-js.txt", "jsonld": "https://wpnews.pro/news/how-i-cut-my-ai-bill-by-caching-llm-responses-in-node-js.jsonld"}}