# Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector

> Source: <https://dev.to/machinecodingmaster/stop-wasting-llm-budgets-high-performance-semantic-caching-with-spring-ai-and-pgvector-2n1o>
> Published: 2026-06-21 07:17:35+00:00

Your enterprise is likely bleeding thousands of dollars on duplicate LLM API calls because your Redis cache fails when a user asks "How do I reset my password?" instead of "Password reset steps." In 2026, relying on exact-string matching for LLM caching is a rookie mistake that kills both your latency and your budget.

`pgvector`

perform native, hardware-accelerated cosine distance queries.Intercept LLM calls at the framework level using Spring AI Advisors paired with a local embedding model and a pgvector-backed similarity search.

`CallAroundAdvisor`

to transparently intercept prompts before they hit the external LLM provider.`all-MiniLM-L6-v2`

) inside your JVM process to generate query embeddings in under 5ms, avoiding external network hops.`pgvector`

with an HNSW index, filtering results with a strict similarity threshold (e.g., `> 0.96`

).Here is how to implement a high-performance, reusable semantic cache advisor using Spring AI:

```
public class SemanticCacheAdvisor implements CallAroundAdvisor {
    private final PgVectorStore vectorStore;
    private final double similarityThreshold = 0.96;

    @Override
    public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {
        String query = request.getPrompt().getInstructions().get(0).getContent();
        var matches = vectorStore.similaritySearch(
            SearchRequest.query(query).withSimilarityThreshold(similarityThreshold).withTopK(1)
        );
        if (!matches.isEmpty()) {
            return AdvisedResponse.from(matches.get(0).getMetadata().get("cached_response").toString());
        }
        AdvisedResponse response = chain.nextAroundCall(request);
        var cachedDoc = new Document(query, Map.of("cached_response", response.getMessage()));
        vectorStore.add(List.of(cachedDoc));
        return response;
    }
}
```

`Advisor`

chain to handle semantic caching transparently without polluting your services.`pgvector`

columns to maintain sub-10ms query times as your cache grows to millions of rows.I built

[javalld.com]while prepping for senior roles — complete LLD problems with execution traces, not just theory.
