cd /news/large-language-models/stop-wasting-llm-budgets-high-perfor… · home topics large-language-models article
[ARTICLE · art-35389] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector

A developer built a high-performance semantic caching system for LLM calls using Spring AI and pgvector, intercepting prompts with a CallAroundAdvisor and a local embedding model to generate query embeddings in under 5ms. The system uses pgvector with an HNSW index and a similarity threshold of 0.96 to serve cached responses, reducing duplicate API calls and saving costs.

read1 min views1 publishedJun 21, 2026

Your enterprise is likely bleeding thousands of dollars on duplicate LLM API calls because your Redis cache fails when a user asks "How do I reset my password?" instead of "Password reset steps." In 2026, relying on exact-string matching for LLM caching is a rookie mistake that kills both your latency and your budget.

pgvector

perform native, hardware-accelerated cosine distance queries.Intercept LLM calls at the framework level using Spring AI Advisors paired with a local embedding model and a pgvector-backed similarity search.

CallAroundAdvisor

to transparently intercept prompts before they hit the external LLM provider.all-MiniLM-L6-v2

) inside your JVM process to generate query embeddings in under 5ms, avoiding external network hops.pgvector

with an HNSW index, filtering results with a strict similarity threshold (e.g., > 0.96

).Here is how to implement a high-performance, reusable semantic cache advisor using Spring AI:

public class SemanticCacheAdvisor implements CallAroundAdvisor {
    private final PgVectorStore vectorStore;
    private final double similarityThreshold = 0.96;

    @Override
    public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {
        String query = request.getPrompt().getInstructions().get(0).getContent();
        var matches = vectorStore.similaritySearch(
            SearchRequest.query(query).withSimilarityThreshold(similarityThreshold).withTopK(1)
        );
        if (!matches.isEmpty()) {
            return AdvisedResponse.from(matches.get(0).getMetadata().get("cached_response").toString());
        }
        AdvisedResponse response = chain.nextAroundCall(request);
        var cachedDoc = new Document(query, Map.of("cached_response", response.getMessage()));
        vectorStore.add(List.of(cachedDoc));
        return response;
    }
}

Advisor

chain to handle semantic caching transparently without polluting your services.pgvector

columns to maintain sub-10ms query times as your cache grows to millions of rows.I built

[javalld.com]while prepping for senior roles — complete LLD problems with execution traces, not just theory.

── more in #large-language-models 4 stories · sorted by recency
── more on @spring ai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/stop-wasting-llm-bud…] indexed:0 read:1min 2026-06-21 ·