Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector

A developer built a high-performance semantic caching system for LLM calls using Spring AI and pgvector, intercepting prompts with a CallAroundAdvisor and a local embedding model to generate query embeddings in under 5ms. The system uses pgvector with an HNSW index and a similarity threshold of 0.96 to serve cached responses, reducing duplicate API calls and saving costs.

Your enterprise is likely bleeding thousands of dollars on duplicate LLM API calls because your Redis cache fails when a user asks "How do I reset my password?" instead of "Password reset steps." In 2026, relying on exact-string matching for LLM caching is a rookie mistake that kills both your latency and your budget. pgvector perform native, hardware-accelerated cosine distance queries.Intercept LLM calls at the framework level using Spring AI Advisors paired with a local embedding model and a pgvector-backed similarity search. CallAroundAdvisor to transparently intercept prompts before they hit the external LLM provider. all-MiniLM-L6-v2 inside your JVM process to generate query embeddings in under 5ms, avoiding external network hops. pgvector with an HNSW index, filtering results with a strict similarity threshold e.g., 0.96 .Here is how to implement a high-performance, reusable semantic cache advisor using Spring AI: public class SemanticCacheAdvisor implements CallAroundAdvisor { private final PgVectorStore vectorStore; private final double similarityThreshold = 0.96; @Override public AdvisedResponse aroundCall AdvisedRequest request, CallAroundAdvisorChain chain { String query = request.getPrompt .getInstructions .get 0 .getContent ; var matches = vectorStore.similaritySearch SearchRequest.query query .withSimilarityThreshold similarityThreshold .withTopK 1 ; if matches.isEmpty { return AdvisedResponse.from matches.get 0 .getMetadata .get "cached response" .toString ; } AdvisedResponse response = chain.nextAroundCall request ; var cachedDoc = new Document query, Map.of "cached response", response.getMessage ; vectorStore.add List.of cachedDoc ; return response; } } Advisor chain to handle semantic caching transparently without polluting your services. pgvector columns to maintain sub-10ms query times as your cache grows to millions of rows.I built javalld.com while prepping for senior roles — complete LLD problems with execution traces, not just theory.