{"slug": "stop-wasting-llm-budgets-high-performance-semantic-caching-with-spring-ai-and", "title": "Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector", "summary": "A developer built a high-performance semantic caching system for LLM calls using Spring AI and pgvector, intercepting prompts with a CallAroundAdvisor and a local embedding model to generate query embeddings in under 5ms. The system uses pgvector with an HNSW index and a similarity threshold of 0.96 to serve cached responses, reducing duplicate API calls and saving costs.", "body_md": "Your enterprise is likely bleeding thousands of dollars on duplicate LLM API calls because your Redis cache fails when a user asks \"How do I reset my password?\" instead of \"Password reset steps.\" In 2026, relying on exact-string matching for LLM caching is a rookie mistake that kills both your latency and your budget.\n\n`pgvector`\n\nperform native, hardware-accelerated cosine distance queries.Intercept LLM calls at the framework level using Spring AI Advisors paired with a local embedding model and a pgvector-backed similarity search.\n\n`CallAroundAdvisor`\n\nto transparently intercept prompts before they hit the external LLM provider.`all-MiniLM-L6-v2`\n\n) inside your JVM process to generate query embeddings in under 5ms, avoiding external network hops.`pgvector`\n\nwith an HNSW index, filtering results with a strict similarity threshold (e.g., `> 0.96`\n\n).Here is how to implement a high-performance, reusable semantic cache advisor using Spring AI:\n\n```\npublic class SemanticCacheAdvisor implements CallAroundAdvisor {\n    private final PgVectorStore vectorStore;\n    private final double similarityThreshold = 0.96;\n\n    @Override\n    public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {\n        String query = request.getPrompt().getInstructions().get(0).getContent();\n        var matches = vectorStore.similaritySearch(\n            SearchRequest.query(query).withSimilarityThreshold(similarityThreshold).withTopK(1)\n        );\n        if (!matches.isEmpty()) {\n            return AdvisedResponse.from(matches.get(0).getMetadata().get(\"cached_response\").toString());\n        }\n        AdvisedResponse response = chain.nextAroundCall(request);\n        var cachedDoc = new Document(query, Map.of(\"cached_response\", response.getMessage()));\n        vectorStore.add(List.of(cachedDoc));\n        return response;\n    }\n}\n```\n\n`Advisor`\n\nchain to handle semantic caching transparently without polluting your services.`pgvector`\n\ncolumns to maintain sub-10ms query times as your cache grows to millions of rows.I built\n\n[javalld.com]while prepping for senior roles — complete LLD problems with execution traces, not just theory.", "url": "https://wpnews.pro/news/stop-wasting-llm-budgets-high-performance-semantic-caching-with-spring-ai-and", "canonical_source": "https://dev.to/machinecodingmaster/stop-wasting-llm-budgets-high-performance-semantic-caching-with-spring-ai-and-pgvector-2n1o", "published_at": "2026-06-21 07:17:35+00:00", "updated_at": "2026-06-21 08:06:41.748019+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-infrastructure", "machine-learning", "natural-language-processing"], "entities": ["Spring AI", "pgvector", "all-MiniLM-L6-v2", "HNSW", "javalld.com"], "alternates": {"html": "https://wpnews.pro/news/stop-wasting-llm-budgets-high-performance-semantic-caching-with-spring-ai-and", "markdown": "https://wpnews.pro/news/stop-wasting-llm-budgets-high-performance-semantic-caching-with-spring-ai-and.md", "text": "https://wpnews.pro/news/stop-wasting-llm-budgets-high-performance-semantic-caching-with-spring-ai-and.txt", "jsonld": "https://wpnews.pro/news/stop-wasting-llm-budgets-high-performance-semantic-caching-with-spring-ai-and.jsonld"}}