Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector

wpnews.pro

cd /news/large-language-models/stop-wasting-llm-budgets-high-perfor… · home › topics › large-language-models › article

[ARTICLE · art-35389] src=dev.to ↗ pub=2026-06-21T07:17Z topic=large-language-models verified=true sentiment=↑ positive

Stop Wasting LLM Budgets: High-Performance Semantic Caching with Spring AI and pgvector

A developer built a high-performance semantic caching system for LLM calls using Spring AI and pgvector, intercepting prompts with a CallAroundAdvisor and a local embedding model to generate query embeddings in under 5ms. The system uses pgvector with an HNSW index and a similarity threshold of 0.96 to serve cached responses, reducing duplicate API calls and saving costs.

read1 min views1 publishedJun 21, 2026

Your enterprise is likely bleeding thousands of dollars on duplicate LLM API calls because your Redis cache fails when a user asks "How do I reset my password?" instead of "Password reset steps." In 2026, relying on exact-string matching for LLM caching is a rookie mistake that kills both your latency and your budget.

pgvector

perform native, hardware-accelerated cosine distance queries.Intercept LLM calls at the framework level using Spring AI Advisors paired with a local embedding model and a pgvector-backed similarity search.

CallAroundAdvisor

to transparently intercept prompts before they hit the external LLM provider.all-MiniLM-L6-v2

) inside your JVM process to generate query embeddings in under 5ms, avoiding external network hops.pgvector

with an HNSW index, filtering results with a strict similarity threshold (e.g., > 0.96

).Here is how to implement a high-performance, reusable semantic cache advisor using Spring AI:

public class SemanticCacheAdvisor implements CallAroundAdvisor {
    private final PgVectorStore vectorStore;
    private final double similarityThreshold = 0.96;

    @Override
    public AdvisedResponse aroundCall(AdvisedRequest request, CallAroundAdvisorChain chain) {
        String query = request.getPrompt().getInstructions().get(0).getContent();
        var matches = vectorStore.similaritySearch(
            SearchRequest.query(query).withSimilarityThreshold(similarityThreshold).withTopK(1)
        );
        if (!matches.isEmpty()) {
            return AdvisedResponse.from(matches.get(0).getMetadata().get("cached_response").toString());
        }
        AdvisedResponse response = chain.nextAroundCall(request);
        var cachedDoc = new Document(query, Map.of("cached_response", response.getMessage()));
        vectorStore.add(List.of(cachedDoc));
        return response;
    }
}

Advisor

chain to handle semantic caching transparently without polluting your services.pgvector

columns to maintain sub-10ms query times as your cache grows to millions of rows.I built

[javalld.com]while prepping for senior roles — complete LLD problems with execution traces, not just theory.

source & further reading

dev.to — original article The Code I Shared: Reflections on my best 29+ Merged Pull Requests in Open Source JShell: Java's Built-In Scratchpad for Trying Code Fast Google Paid $2.7B to Keep Its Best AI Researcher. He Left Anyway.

~/api · this article 200

$curl api.wpnews.pro/v1/news/stop-wasting-llm-budgets…

Read original on dev.to → dev.to/machinecodingmaster/stop-wasting-llm-budg…

mentioned entities

Spring AI

pgvector

all-MiniLM-L6-v2

HNSW

javalld.com

metadata

slugstop-wasting-llm-budgets-high-performance-semantic-caching-with-spring-ai-and

topic#large-language-models

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevLBE – open-source execution cont…

next →Do MCP's use more tokens than CL…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 21 Jun · #large-language-models

Vector Databases Compared: pgvector, Qdrant, Pinecone, Weaviate

dev.to · 19 Jun · #large-language-models

Vector Databases Are Not Magic, Here's What's Actually Happening Under the Hood

discuss.huggingface.co · 17 Jun · #large-language-models

Independent Researcher seeking arXiv endorsement for cs.SE (Software Engineering) - Local-First AI Platform

dev.to · 21 Jun · #large-language-models

Your AI Isn't Broken. Your Architecture Is.

── more on @spring ai 3 stories trending now

wpnews · 20 Jun · #ai-agents

Amazon Bedrock AgentCore Memory: Build AI Agents That Remember

wpnews · 20 Jun · #ai-safety

SR 11-7 Model Risk for AI Systems: What Banks Actually Need to Build

wpnews · 20 Jun · #artificial-intelligence

Microsoft is rewriting the economics of enterprise AI and the bill shock is just getting started

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required