Show HN: Khazad – Transparent Semantic Cache for LLM Calls on Redis Vector Sets

wpnews.pro

Transparent, transport-layer semantic cache for LLM API calls powered by Redis Vector Sets.

~50% fewer API calls · ~96% faster on hits · ~50% lower spend · 100% transparent

Illustrative figures at a 0.50 hit rate (280ms cache replay vs. ~7900ms upstream call). Your numbers depend on traffic.

Khazad intercepts LLM HTTP traffic at the transport layer and serves semantically equivalent requests from a Redis vector cache, with zero changes to your application code.

Key properties:

Model-aware— each(provider, model)

pair gets its own vector set, so agpt-4o

answer is never served to agpt-4o-mini

call, no matter how similar the prompt. Setcache_scope="host"

to scope byprovider host only, letting every model or deployment on the same provider share one cache (different providers stay isolated — seeConfiguration).Conversation-aware— the whole message list (system, user, assistant) is embedded, not just the last user turn. Two different conversations ending with the same follow-up question ("What about its population?") never collide.Streaming both ways— cache hits replay as real SSE streams (sync and async clients); cache misses that stream are captured chunk-by-chunk with no added latency and reassembled into a canonical JSON response, so a streamed answer can later serve a non-streamed request and vice versa. Aborted streams are never cached.

Semantic caching trades exactness for cost and latency. Know the trade before turning it on. Use it when you have:

High-volume, repetitive traffic: FAQ bots, support assistants, RAG front-ends where many users ask near-identical questions
Dev / test / CI environments — stop paying for the same prompt on every run
Demos and load tests where deterministic, instant responses are a feature
Cost ceilings on internal tools

Operational caveats:

Privacy: prompts are embedded and responses are stored** in clear text in Redis**. If prompts may contain PII or secrets, set attl

, enable Redis AUTH/TLS, and treat the Redis instance with the same care as your logs.Process-wide patch: Khazad wrapseveryhttpx.Client

/AsyncClient

created afterinit()

— non-LLM httpx traffic passes through untouched, but the patch is process-global. Callstop()

on shutdown. Usehosts=[...]

to restrict interception to the endpoints you actually want cached.httpx-only: SDKs built onhttpx

are covered (OpenAI, Anthropic, Gemini viagoogle-genai

, Mistral, and most proxies). SDKs usingrequests

,aiohttp

, orboto3

(AWS Bedrock) are not intercepted.Single process: the patch lives in the Python process that calledinit()

. Multiple workers share the Redis cache but each needs its owninit()

.False-positive control: start atthreshold=0.90

andraiseit if you see wrong hits. Watchavg_hit_similarity

inget_stats()

— if it sits near your threshold, your traffic may be too diverse to cache safely.

Python >= 3.10
Redis 8 (Vector Sets support required)

docker run -d --name redis8 -p 6379:6379 redis:8

From PyPI:

uv add khazad

For the OpenAI embedding backend (optional):

uv add khazad[openai-embeddings]

Local / development install:

git clone https://github.com/GuglielmoCerri/khazad.git
cd khazad
uv sync --group dev

uv sync

reads pyproject.toml

, creates .venv

if it doesn't exist, and installs the project itself in editable mode — no separate pip install -e .

needed.

To use the local checkout from another project:

uv add --editable /path/to/khazad

Use the Khazad

class directly when you need explicit control over the instance, useful in long-running services, tests, or dependency injection:

from khazad import Khazad

cache = Khazad(
    redis_url="redis://localhost:6379",
    threshold=0.90,
    ttl=3600,
    log_level="DEBUG",
)

print(cache.is_active())   # True
print(cache.get_stats())   # Stats(total_requests=0, ...)

Available functions: init()

, stop()

, get_stats()

, flush()

, is_active()

. See API Reference for details.

Khazad activates once and intercepts every LLM SDK that uses httpx

underneath, no per-provider wiring needed. For further examples see the examples folder.

Pick the provider you use:

OpenAI — official SDK against api.openai.com

import os
import time

from openai import OpenAI

from khazad import Khazad

cache = Khazad(redis_url="redis://localhost:6379", threshold=0.90, log_level="DEBUG")

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

prompt = "What is the capital of Italy?"

for i in range(2):
    start = time.perf_counter()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    )
    elapsed = (time.perf_counter() - start) * 1000
    print(f"[call {i + 1}] {elapsed:.1f}ms — {response.choices[0].message.content}")

print(cache.get_stats().to_dict())
cache.stop()

Matches */chat/completions

and */responses

paths. Streaming requests also cached.

Azure OpenAI — Azure deployments with Entra ID auth via AzureOpenAI

SDK

import os
import time

from azure.identity import DefaultAzureCredential, get_bearer_token_provider
from openai import AzureOpenAI

from khazad import CacheScope, Khazad

cache = Khazad(
    redis_url="redis://localhost:6379",
    threshold=0.90,
    cache_scope=CacheScope.HOST,
    namespace="azure_openai_example",
)

endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
deployment = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-4.1")
token_provider = get_bearer_token_provider(
    DefaultAzureCredential(), "https://cognitiveservices.azure.com/.default"
)
api_version = os.environ.get("AZURE_OPENAI_API_VERSION", "2024-12-01-preview")

client = AzureOpenAI(
    api_version=api_version,
    azure_endpoint=endpoint,
    azure_ad_token_provider=token_provider,
)

prompt = "What is the capital of Spain?"

for i in range(2):
    start = time.perf_counter()
    response = client.chat.completions.create(
        model=deployment,
        messages=[{"role": "user", "content": prompt}],
    )
    elapsed = (time.perf_counter() - start) * 1000
    print(f"[call {i + 1}] {elapsed:.1f}ms — {response.choices[0].message.content}")

print(cache.get_stats().to_dict())
cache.stop()

It authenticates with Microsoft Entra ID (DefaultAzureCredential

) — no API key needed — and uses cache_scope=CacheScope.HOST

so every deployment on the same Azure resource shares one cache. API-key auth works too: Khazad matches the request path (/chat/completions

), not the auth method or host.

OpenAI-compatible proxies — LiteLLM, vLLM, Ollama, …

import time

from openai import OpenAI

from khazad import Khazad

cache = Khazad(redis_url="redis://localhost:6379", threshold=0.90, namespace="ollama_example")

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
model = "llama3"

prompt = "What is the capital of Spain?"

for i in range(2):
    start = time.perf_counter()
    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    elapsed = (time.perf_counter() - start) * 1000
    print(f"[call {i + 1}] {elapsed:.1f}ms — {response.choices[0].message.content}")

print(cache.get_stats().to_dict())
cache.stop()

Any host whose URL path ends with /chat/completions

or /responses

is cached. Covers vLLM (http://host:8000/v1/...

), Ollama (http://localhost:11434/v1/...

), Mistral, etc.

Anthropic — Claude via official SDK

import os
import time

from anthropic import Anthropic

from khazad import Khazad

cache = Khazad(redis_url="redis://localhost:6379", threshold=0.90, namespace="anthropic_example")

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
model = "claude-haiku-4-5-20251001"

prompt = "What is the capital of France?"

for i in range(2):
    start = time.perf_counter()
    message = client.messages.create(
        model=model,
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}],
    )
    elapsed = (time.perf_counter() - start) * 1000
    print(f"[call {i + 1}] {elapsed:.1f}ms — {message.content[0].text}")

print(cache.get_stats().to_dict())
cache.stop()

Matches api.anthropic.com/v1/messages

. Streaming responses replayed from cache as SSE.

Google Gemini — google-genai

SDK

import os
import time

from google import genai

from khazad import Khazad

cache = Khazad(redis_url="redis://localhost:6379", threshold=0.90)
client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])

for i in range(2):
    start = time.perf_counter()
    response = client.models.generate_content(
        model="gemini-2.5-flash",
        contents="What is the capital of Italy?",
    )
    elapsed = (time.perf_counter() - start) * 1000
    print(f"[call {i + 1}] {elapsed:.1f}ms — {response.text}...")

print(cache.get_stats().to_dict())
cache.stop()

Matches generativelanguage.googleapis.com/*/models/*:generateContent

. Gemini streaming (:streamGenerateContent

) passes through uncached.

Provider	URL pattern matched	Streaming
OpenAI Chat Completions	any host, path ending `/chat/completions`
cached + replayed
OpenAI Responses API	any host, path ending `/responses`
cached + replayed
Azure OpenAI	covered by chat/completions matcher	cached + replayed
OpenAI-compatible proxies	covered by chat/completions matcher	cached + replayed
Anthropic	`api.anthropic.com/v1/messages`
cached + replayed
Google Gemini	`generativelanguage.googleapis.com/*:generateContent`
pass-through

The module exposes five functions as the singleton API. The Khazad

class exposes the same surface as instance methods (no init

, instantiation does that).

Function	Description
`init(...)`
Activate the global singleton: builds the embedder, connects to Redis, installs the `httpx` transport patch. Required before any LLM traffic; calling twice without `stop()` is a no-op. See

`stop()`
Restore original `httpx` transports, close Redis, and clear the singleton. Idempotent. Cached data in Redis stays.
`get_stats()`
Returns a dictionary with a thread-safe snapshot of cache metrics (requests, hits, misses, hit rate, avg similarity).
`flush()`
Clear all cached entries in the current namespace and reset stats counters. Destructive.
`is_active()`
Returns `True` if Khazad is currently running (initialized and not stopped).

All parameters are the same whether you use khazad.init()

or Khazad(...)

:

Khazad(
    redis_url="redis://localhost:6379",  
    threshold=0.90,                     
    ttl=3600,                            
    namespace="khazad",                 
    embedder="huggingface",              
    embedding_model="redis/langcache-embed-v2",
    log_level="INFO",                    
    hosts=None,                          
    cache_scope="model"
)

Parameter	Default	Description
`redis_url`
`"redis://localhost:6379"`
Connection URL for the Redis 8 instance that stores vectors and cached responses.
`threshold`
`0.90`
Cosine similarity threshold (0.0-1.0) above which a request counts as a cache hit.
`ttl`
`3600`
Time-to-live in seconds for cached response bodies; `None` means no expiry.
`namespace`
`"khazad"`
Prefix for all Redis keys, isolating this cache from other data and other namespaces.
`embedder`
`"huggingface"`
Embedding backend: `"huggingface"` (free, local) or `"openai"` (paid API).
`embedding_model`
`"redis/langcache-embed-v2"`
Model used to embed prompts; must match the chosen `embedder` .
`log_level`
`"INFO"`
Logging verbosity: `DEBUG` , `INFO` , `WARNING` , or `ERROR` .
`hosts`
`None`
Opt-in host allowlist; `None` means all matching hosts (see below).
`cache_scope`
`"model"`
Cache partitioning: `"model"` (per `(host, model)` ) or `"host"` (per provider, see below).

** hosts — opt-in allowlist.** By default Khazad considers traffic to any host that matches a provider URL pattern. Pass an explicit allowlist to restrict interception to the endpoints you intend to cache; everything else passes through untouched. Supports exact hosts and

*.

wildcard subdomains:

khazad.init(hosts=["api.openai.com", "*.openai.azure.com"])

** cache_scope — share one cache across a provider's models.** Driven by the

CacheScope

enum (importable from khazad

); the string values "model"

and "host"

are accepted too. By default (CacheScope.MODEL

) each (host, model)

pair gets its own vector set, so a gpt-4o

answer never serves a gpt-4o-mini

call. Set it to CacheScope.HOST

to scope by host only— every model or deployment on the same provider then shares a single cache:

from khazad import CacheScope

khazad.init(cache_scope=CacheScope.HOST)   # or cache_scope="host"

The host always stays part of the scope, so different providers never mix (an Azure OpenAI response is never replayed to a Gemini client). Use it only for format-compatible pools — e.g. multiple Azure OpenAI deployments, or treating gpt-4o

and gpt-4o-mini

as interchangeable. The trade-off is semantic: a smaller model may serve an answer originally produced by a larger one.

Threshold guidance:

0.95+

— strict, near-identical prompts only0.90

— recommended default0.85

— aggressive, higher hit rate

TTL: the response body expires in Redis after ttl

seconds. Khazad prunes the matching vector automatically the next time it is found without a body, so expired entries clean themselves up.

Backend	Cost	Notes
`huggingface` (default)
Free	Downloads model on first use
`openai`
Paid	`uv add khazad[openai-embeddings]`

Khazad emits a log line for every intercepted request, so you can watch cache behaviour in real time. A hit reports the cosine similarity that triggered it and the replay latency; a miss notes that the request was forwarded upstream. Raise log_level

to DEBUG

for per-request detail, or keep it at INFO

for just hits and misses.

[Khazad] CACHE HIT - Similarity: 0.94 - Latency: 4ms
[Khazad] CACHE MISS - Forwarding to API

For aggregate metrics, call get_stats()

at any time. It returns a thread-safe snapshot of total requests, hits, misses, the resulting hit rate, and the average similarity of served hits — ideal for periodic logging or exposing as a Prometheus gauge. Watch avg_hit_similarity

: if it hovers near your threshold

, your traffic may be too diverse to cache safely and you should raise the threshold.

cache.get_stats().to_dict()

Contributions welcome — see CONTRIBUTING.md. Please read it before opening a pull request, as it covers the branching model, coding conventions, and the lint and test checks your changes are expected to pass.

The unit and integration suites use fake embedders and mock transports, so the full test run needs neither a live Redis instance nor real API keys.

source & further reading

github.com — original article

Show HN: Khazad – Transparent Semantic Cache for LLM Calls on Redis Vector Sets

Run your AI side-project on zahid.host