Show HN: Khazad – Transparent Semantic Cache for LLM Calls on Redis Vector Sets

Khazad, a transparent semantic cache for LLM API calls built on Redis Vector Sets, reduces API calls by ~50% and latency by ~96% by intercepting HTTP traffic at the transport layer without code changes. The tool supports model-aware and conversation-aware caching, streaming both ways, and is designed for high-volume repetitive traffic like FAQ bots and CI environments.

Transparent, transport-layer semantic cache for LLM API calls powered by Redis Vector Sets. ~50% fewer API calls · ~96% faster on hits · ~50% lower spend · 100% transparent Illustrative figures at a 0.50 hit rate 280ms cache replay vs. ~7900ms upstream call . Your numbers depend on traffic. Khazad intercepts LLM HTTP traffic at the transport layer and serves semantically equivalent requests from a Redis vector cache, with zero changes to your application code . Key properties: Model-aware — each provider, model pair gets its own vector set, so a gpt-4o answer is never served to a gpt-4o-mini call, no matter how similar the prompt. Set cache scope="host" to scope by provider host only , letting every model or deployment on the same provider share one cache different providers stay isolated — see Configuration configuration . Conversation-aware — the whole message list system, user, assistant is embedded, not just the last user turn. Two different conversations ending with the same follow-up question "What about its population?" never collide. Streaming both ways — cache hits replay as real SSE streams sync and async clients ; cache misses that stream are captured chunk-by-chunk with no added latency and reassembled into a canonical JSON response, so a streamed answer can later serve a non-streamed request and vice versa. Aborted streams are never cached. Semantic caching trades exactness for cost and latency. Know the trade before turning it on. Use it when you have: - High-volume, repetitive traffic: FAQ bots, support assistants, RAG front-ends where many users ask near-identical questions - Dev / test / CI environments — stop paying for the same prompt on every run - Demos and load tests where deterministic, instant responses are a feature - Cost ceilings on internal tools Operational caveats: Privacy : prompts are embedded and responses are stored in clear text in Redis . If prompts may contain PII or secrets, set a ttl , enable Redis AUTH/TLS, and treat the Redis instance with the same care as your logs. Process-wide patch : Khazad wraps every httpx.Client / AsyncClient created after init — non-LLM httpx traffic passes through untouched, but the patch is process-global. Call stop on shutdown. Use hosts= ... to restrict interception to the endpoints you actually want cached. httpx-only : SDKs built on httpx are covered OpenAI, Anthropic, Gemini via google-genai , Mistral, and most proxies . SDKs using requests , aiohttp , or boto3 AWS Bedrock are not intercepted. Single process : the patch lives in the Python process that called init . Multiple workers share the Redis cache but each needs its own init . False-positive control : start at threshold=0.90 and raise it if you see wrong hits. Watch avg hit similarity in get stats — if it sits near your threshold, your traffic may be too diverse to cache safely. - Python = 3.10 - Redis 8 Vector Sets support required docker run -d --name redis8 -p 6379:6379 redis:8 From PyPI : uv add khazad For the OpenAI embedding backend optional : uv add khazad openai-embeddings Local / development install: git clone https://github.com/GuglielmoCerri/khazad.git cd khazad uv sync --group dev uv sync reads pyproject.toml , creates .venv if it doesn't exist, and installs the project itself in editable mode — no separate pip install -e . needed. To use the local checkout from another project: uv add --editable /path/to/khazad Use the Khazad class directly when you need explicit control over the instance, useful in long-running services, tests, or dependency injection: python from khazad import Khazad cache = Khazad redis url="redis://localhost:6379", threshold=0.90, ttl=3600, log level="DEBUG", print cache.is active True print cache.get stats Stats total requests=0, ... Available functions: init , stop , get stats , flush , is active . See API Reference api-reference for details. Khazad activates once and intercepts every LLM SDK that uses httpx underneath, no per-provider wiring needed. For further examples see the examples folder https://github.com/GuglielmoCerri/khazad/tree/main/examples . Pick the provider you use: OpenAI — official SDK against api.openai.com python import os import time from openai import OpenAI from khazad import Khazad cache = Khazad redis url="redis://localhost:6379", threshold=0.90, log level="DEBUG" client = OpenAI api key=os.environ "OPENAI API KEY" prompt = "What is the capital of Italy?" for i in range 2 : start = time.perf counter response = client.chat.completions.create model="gpt-4o", messages= {"role": "user", "content": prompt} , elapsed = time.perf counter - start 1000 print f" call {i + 1} {elapsed:.1f}ms — {response.choices 0 .message.content}" print cache.get stats .to dict cache.stop Matches /chat/completions and /responses paths. Streaming requests also cached. Azure OpenAI — Azure deployments with Entra ID auth via AzureOpenAI SDK python import os import time from azure.identity import DefaultAzureCredential, get bearer token provider from openai import AzureOpenAI from khazad import CacheScope, Khazad cache = Khazad redis url="redis://localhost:6379", threshold=0.90, cache scope=CacheScope.HOST, namespace="azure openai example", endpoint = os.environ "AZURE OPENAI ENDPOINT" deployment = os.environ.get "AZURE OPENAI DEPLOYMENT", "gpt-4.1" token provider = get bearer token provider DefaultAzureCredential , "https://cognitiveservices.azure.com/.default" api version = os.environ.get "AZURE OPENAI API VERSION", "2024-12-01-preview" client = AzureOpenAI api version=api version, azure endpoint=endpoint, azure ad token provider=token provider, prompt = "What is the capital of Spain?" for i in range 2 : start = time.perf counter response = client.chat.completions.create model=deployment, messages= {"role": "user", "content": prompt} , elapsed = time.perf counter - start 1000 print f" call {i + 1} {elapsed:.1f}ms — {response.choices 0 .message.content}" print cache.get stats .to dict cache.stop It authenticates with Microsoft Entra ID DefaultAzureCredential — no API key needed — and uses cache scope=CacheScope.HOST so every deployment on the same Azure resource shares one cache. API-key auth works too: Khazad matches the request path /chat/completions , not the auth method or host. OpenAI-compatible proxies — LiteLLM, vLLM, Ollama, … python import time from openai import OpenAI from khazad import Khazad cache = Khazad redis url="redis://localhost:6379", threshold=0.90, namespace="ollama example" client = OpenAI base url="http://localhost:11434/v1", api key="ollama" model = "llama3" prompt = "What is the capital of Spain?" for i in range 2 : start = time.perf counter response = client.chat.completions.create model=model, messages= {"role": "user", "content": prompt} , elapsed = time.perf counter - start 1000 print f" call {i + 1} {elapsed:.1f}ms — {response.choices 0 .message.content}" print cache.get stats .to dict cache.stop Any host whose URL path ends with /chat/completions or /responses is cached. Covers vLLM http://host:8000/v1/... , Ollama http://localhost:11434/v1/... , Mistral, etc. Anthropic — Claude via official SDK python import os import time from anthropic import Anthropic from khazad import Khazad cache = Khazad redis url="redis://localhost:6379", threshold=0.90, namespace="anthropic example" client = Anthropic api key=os.environ "ANTHROPIC API KEY" model = "claude-haiku-4-5-20251001" prompt = "What is the capital of France?" for i in range 2 : start = time.perf counter message = client.messages.create model=model, max tokens=256, messages= {"role": "user", "content": prompt} , elapsed = time.perf counter - start 1000 print f" call {i + 1} {elapsed:.1f}ms — {message.content 0 .text}" print cache.get stats .to dict cache.stop Matches api.anthropic.com/v1/messages . Streaming responses replayed from cache as SSE. Google Gemini — google-genai SDK python import os import time from google import genai from khazad import Khazad cache = Khazad redis url="redis://localhost:6379", threshold=0.90 client = genai.Client api key=os.environ "GEMINI API KEY" for i in range 2 : start = time.perf counter response = client.models.generate content model="gemini-2.5-flash", contents="What is the capital of Italy?", elapsed = time.perf counter - start 1000 print f" call {i + 1} {elapsed:.1f}ms — {response.text}..." print cache.get stats .to dict cache.stop Matches generativelanguage.googleapis.com/ /models/ :generateContent . Gemini streaming :streamGenerateContent passes through uncached. | Provider | URL pattern matched | Streaming | |---|---|---| | OpenAI Chat Completions | any host, path ending /chat/completions | cached + replayed | | OpenAI Responses API | any host, path ending /responses | cached + replayed | | Azure OpenAI | covered by chat/completions matcher | cached + replayed | | OpenAI-compatible proxies | covered by chat/completions matcher | cached + replayed | | Anthropic | api.anthropic.com/v1/messages | cached + replayed | | Google Gemini | generativelanguage.googleapis.com/ :generateContent | pass-through | The module exposes five functions as the singleton API. The Khazad class exposes the same surface as instance methods no init , instantiation does that . | Function | Description | |---|---| init ... | Activate the global singleton: builds the embedder, connects to Redis, installs the httpx transport patch. Required before any LLM traffic; calling twice without stop is a no-op. See | stop | Restore original httpx transports, close Redis, and clear the singleton. Idempotent. Cached data in Redis stays. | get stats | Returns a dictionary with a thread-safe snapshot of cache metrics requests, hits, misses, hit rate, avg similarity . | flush | Clear all cached entries in the current namespace and reset stats counters. Destructive. | is active | Returns True if Khazad is currently running initialized and not stopped . | All parameters are the same whether you use khazad.init or Khazad ... : Khazad redis url="redis://localhost:6379", threshold=0.90, ttl=3600, namespace="khazad", embedder="huggingface", embedding model="redis/langcache-embed-v2", log level="INFO", hosts=None, cache scope="model" | Parameter | Default | Description | |---|---|---| redis url | "redis://localhost:6379" | Connection URL for the Redis 8 instance that stores vectors and cached responses. | threshold | 0.90 | Cosine similarity threshold 0.0-1.0 above which a request counts as a cache hit. | ttl | 3600 | Time-to-live in seconds for cached response bodies; None means no expiry. | namespace | "khazad" | Prefix for all Redis keys, isolating this cache from other data and other namespaces. | embedder | "huggingface" | Embedding backend: "huggingface" free, local or "openai" paid API . | embedding model | "redis/langcache-embed-v2" | Model used to embed prompts; must match the chosen embedder . | log level | "INFO" | Logging verbosity: DEBUG , INFO , WARNING , or ERROR . | hosts | None | Opt-in host allowlist; None means all matching hosts see below . | cache scope | "model" | Cache partitioning: "model" per host, model or "host" per provider, see below . | hosts — opt-in allowlist. By default Khazad considers traffic to any host that matches a provider URL pattern. Pass an explicit allowlist to restrict interception to the endpoints you intend to cache; everything else passes through untouched. Supports exact hosts and . wildcard subdomains: khazad.init hosts= "api.openai.com", " .openai.azure.com" cache scope — share one cache across a provider's models. Driven by the CacheScope enum importable from khazad ; the string values "model" and "host" are accepted too. By default CacheScope.MODEL each host, model pair gets its own vector set, so a gpt-4o answer never serves a gpt-4o-mini call. Set it to CacheScope.HOST to scope by host only — every model or deployment on the same provider then shares a single cache: python from khazad import CacheScope khazad.init cache scope=CacheScope.HOST or cache scope="host" The host always stays part of the scope, so different providers never mix an Azure OpenAI response is never replayed to a Gemini client . Use it only for format-compatible pools — e.g. multiple Azure OpenAI deployments, or treating gpt-4o and gpt-4o-mini as interchangeable. The trade-off is semantic: a smaller model may serve an answer originally produced by a larger one. Threshold guidance: 0.95+ — strict, near-identical prompts only 0.90 — recommended default 0.85 — aggressive, higher hit rate TTL: the response body expires in Redis after ttl seconds. Khazad prunes the matching vector automatically the next time it is found without a body, so expired entries clean themselves up. | Backend | Cost | Notes | |---|---|---| huggingface default | Free | Downloads model on first use | openai | Paid | uv add khazad openai-embeddings | Khazad emits a log line for every intercepted request, so you can watch cache behaviour in real time. A hit reports the cosine similarity that triggered it and the replay latency; a miss notes that the request was forwarded upstream. Raise log level to DEBUG for per-request detail, or keep it at INFO for just hits and misses. Khazad CACHE HIT - Similarity: 0.94 - Latency: 4ms Khazad CACHE MISS - Forwarding to API For aggregate metrics, call get stats at any time. It returns a thread-safe snapshot of total requests, hits, misses, the resulting hit rate, and the average similarity of served hits — ideal for periodic logging or exposing as a Prometheus gauge. Watch avg hit similarity : if it hovers near your threshold , your traffic may be too diverse to cache safely and you should raise the threshold. cache.get stats .to dict {'total requests': 1000, 'cache hits': 720, 'cache misses': 280, 'hit rate': 0.72, 'avg hit similarity': 0.943} Contributions welcome — see CONTRIBUTING.md https://github.com/GuglielmoCerri/khazad/blob/main/CONTRIBUTING.md . Please read it before opening a pull request, as it covers the branching model, coding conventions, and the lint and test checks your changes are expected to pass. The unit and integration suites use fake embedders and mock transports, so the full test run needs neither a live Redis instance nor real API keys.