Your enterprise is likely bleeding thousands of dollars on duplicate LLM API calls because your Redis cache fails when a user asks "How do I reset my password?" instead of "Password reset steps." In 2026, relying on exact-string matching for LLM caching is a rookie mistake that kills both your latency and your budget.
pgvector
perform native, hardware-accelerated cosine distance queries.Intercept LLM calls at the framework level using Spring AI Advisors paired with a local embedding model and a pgvector-backed similarity search.
CallAroundAdvisor
to transparently intercept prompts before they hit the external LLM provider.all-MiniLM-L6-v2
) inside your JVM process to generate query embeddings in under 5ms, avoiding external network hops.pgvector
with an HNSW index, filtering results with a strict similarity threshold (e.g., > 0.96
).Here is how to implement a high-performance, reusable semantic cache advisor using Spring AI:
public class SemanticCacheAdvisor implements CallAroundAdvisor {
private final PgVectorStore vectorStore;
private final double similarityThreshold = 0.96;
@Override
public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {
String query = request.getPrompt().getInstructions().get(0).getContent();
var matches = vectorStore.similaritySearch(
SearchRequest.query(query).withSimilarityThreshold(similarityThreshold).withTopK(1)
);
if (!matches.isEmpty()) {
return AdvisedResponse.from(matches.get(0).getMetadata().get("cached_response").toString());
}
AdvisedResponse response = chain.nextAroundCall(request);
var cachedDoc = new Document(query, Map.of("cached_response", response.getMessage()));
vectorStore.add(List.of(cachedDoc));
return response;
}
}
Advisor
chain to handle semantic caching transparently without polluting your services.pgvector
columns to maintain sub-10ms query times as your cache grows to millions of rows.I built
[javalld.com]while prepping for senior roles — complete LLD problems with execution traces, not just theory.