{"slug": "show-hn-khazad-transparent-semantic-cache-for-llm-calls-on-redis-vector-sets", "title": "Show HN: Khazad – Transparent Semantic Cache for LLM Calls on Redis Vector Sets", "summary": "Khazad, a transparent semantic cache for LLM API calls built on Redis Vector Sets, reduces API calls by ~50% and latency by ~96% by intercepting HTTP traffic at the transport layer without code changes. The tool supports model-aware and conversation-aware caching, streaming both ways, and is designed for high-volume repetitive traffic like FAQ bots and CI environments.", "body_md": "*Transparent, transport-layer semantic cache for LLM API calls powered by Redis Vector Sets.*\n\n**~50% fewer API calls · ~96% faster on hits · ~50% lower spend · 100% transparent**\n\nIllustrative figures at a 0.50 hit rate (280ms cache replay vs. ~7900ms upstream call). Your numbers depend on traffic.\n\nKhazad intercepts LLM HTTP traffic at the **transport layer** and serves semantically equivalent requests from a Redis vector cache, **with zero changes to your application code**.\n\nKey properties:\n\n**Model-aware**— each`(provider, model)`\n\npair gets its own vector set, so a`gpt-4o`\n\nanswer is never served to a`gpt-4o-mini`\n\ncall, no matter how similar the prompt. Set`cache_scope=\"host\"`\n\nto scope by**provider host only**, letting every model or deployment on the same provider share one cache (different providers stay isolated — see[Configuration](#configuration)).**Conversation-aware**— the whole message list (system, user, assistant) is embedded, not just the last user turn. Two different conversations ending with the same follow-up question (\"What about its population?\") never collide.**Streaming both ways**— cache hits replay as real SSE streams (sync and async clients); cache misses that stream are captured chunk-by-chunk with no added latency and reassembled into a canonical JSON response, so a streamed answer can later serve a non-streamed request and vice versa. Aborted streams are never cached.\n\nSemantic caching trades exactness for cost and latency. Know the trade before turning it on. Use it when you have:\n\n- High-volume, repetitive traffic: FAQ bots, support assistants, RAG front-ends where many users ask near-identical questions\n- Dev / test / CI environments — stop paying for the same prompt on every run\n- Demos and load tests where deterministic, instant responses are a feature\n- Cost ceilings on internal tools\n\n**Operational caveats:**\n\n**Privacy**: prompts are embedded and responses are stored** in clear text in Redis**. If prompts may contain PII or secrets, set a`ttl`\n\n, enable Redis AUTH/TLS, and treat the Redis instance with the same care as your logs.**Process-wide patch**: Khazad wraps*every*`httpx.Client`\n\n/`AsyncClient`\n\ncreated after`init()`\n\n— non-LLM httpx traffic passes through untouched, but the patch is process-global. Call`stop()`\n\non shutdown. Use`hosts=[...]`\n\nto restrict interception to the endpoints you actually want cached.**httpx-only**: SDKs built on`httpx`\n\nare covered (OpenAI, Anthropic, Gemini via`google-genai`\n\n, Mistral, and most proxies). SDKs using`requests`\n\n,`aiohttp`\n\n, or`boto3`\n\n(AWS Bedrock) are not intercepted.**Single process**: the patch lives in the Python process that called`init()`\n\n. Multiple workers share the Redis cache but each needs its own`init()`\n\n.**False-positive control**: start at`threshold=0.90`\n\nand*raise*it if you see wrong hits. Watch`avg_hit_similarity`\n\nin`get_stats()`\n\n— if it sits near your threshold, your traffic may be too diverse to cache safely.\n\n- Python >= 3.10\n- Redis 8 (Vector Sets support required)\n\n```\ndocker run -d --name redis8 -p 6379:6379 redis:8\n```\n\n**From PyPI**:\n\n```\nuv add khazad\n```\n\nFor the OpenAI embedding backend (optional):\n\n```\nuv add khazad[openai-embeddings]\n```\n\n**Local / development install:**\n\n```\ngit clone https://github.com/GuglielmoCerri/khazad.git\ncd khazad\nuv sync --group dev\n```\n\n`uv sync`\n\nreads `pyproject.toml`\n\n, creates `.venv`\n\nif it doesn't exist, and installs the project itself in editable mode — no separate `pip install -e .`\n\nneeded.\n\nTo use the local checkout from another project:\n\n```\nuv add --editable /path/to/khazad\n```\n\nUse the `Khazad`\n\nclass directly when you need explicit control over the instance, useful in long-running services, tests, or dependency injection:\n\n``` python\nfrom khazad import Khazad\n\ncache = Khazad(\n    redis_url=\"redis://localhost:6379\",\n    threshold=0.90,\n    ttl=3600,\n    log_level=\"DEBUG\",\n)\n\nprint(cache.is_active())   # True\nprint(cache.get_stats())   # Stats(total_requests=0, ...)\n```\n\nAvailable functions: `init()`\n\n, `stop()`\n\n, `get_stats()`\n\n, `flush()`\n\n, `is_active()`\n\n. See [API Reference](#api-reference) for details.\n\nKhazad activates once and intercepts **every** LLM SDK that uses `httpx`\n\nunderneath, no per-provider wiring needed. For further examples see the [examples folder](https://github.com/GuglielmoCerri/khazad/tree/main/examples).\n\nPick the provider you use:\n\n**OpenAI** — official SDK against `api.openai.com`\n\n``` python\nimport os\nimport time\n\nfrom openai import OpenAI\n\nfrom khazad import Khazad\n\ncache = Khazad(redis_url=\"redis://localhost:6379\", threshold=0.90, log_level=\"DEBUG\")\n\nclient = OpenAI(api_key=os.environ[\"OPENAI_API_KEY\"])\n\nprompt = \"What is the capital of Italy?\"\n\nfor i in range(2):\n    start = time.perf_counter()\n    response = client.chat.completions.create(\n        model=\"gpt-4o\",\n        messages=[{\"role\": \"user\", \"content\": prompt}],\n    )\n    elapsed = (time.perf_counter() - start) * 1000\n    print(f\"[call {i + 1}] {elapsed:.1f}ms — {response.choices[0].message.content}\")\n\nprint(cache.get_stats().to_dict())\ncache.stop()\n```\n\nMatches `*/chat/completions`\n\nand `*/responses`\n\npaths. Streaming requests also cached.\n\n**Azure OpenAI** — Azure deployments with Entra ID auth via `AzureOpenAI`\n\nSDK\n\n``` python\nimport os\nimport time\n\nfrom azure.identity import DefaultAzureCredential, get_bearer_token_provider\nfrom openai import AzureOpenAI\n\nfrom khazad import CacheScope, Khazad\n\ncache = Khazad(\n    redis_url=\"redis://localhost:6379\",\n    threshold=0.90,\n    cache_scope=CacheScope.HOST,\n    namespace=\"azure_openai_example\",\n)\n\nendpoint = os.environ[\"AZURE_OPENAI_ENDPOINT\"]\ndeployment = os.environ.get(\"AZURE_OPENAI_DEPLOYMENT\", \"gpt-4.1\")\ntoken_provider = get_bearer_token_provider(\n    DefaultAzureCredential(), \"https://cognitiveservices.azure.com/.default\"\n)\napi_version = os.environ.get(\"AZURE_OPENAI_API_VERSION\", \"2024-12-01-preview\")\n\nclient = AzureOpenAI(\n    api_version=api_version,\n    azure_endpoint=endpoint,\n    azure_ad_token_provider=token_provider,\n)\n\nprompt = \"What is the capital of Spain?\"\n\nfor i in range(2):\n    start = time.perf_counter()\n    response = client.chat.completions.create(\n        model=deployment,\n        messages=[{\"role\": \"user\", \"content\": prompt}],\n    )\n    elapsed = (time.perf_counter() - start) * 1000\n    print(f\"[call {i + 1}] {elapsed:.1f}ms — {response.choices[0].message.content}\")\n\nprint(cache.get_stats().to_dict())\ncache.stop()\n```\n\nIt authenticates with Microsoft Entra ID (`DefaultAzureCredential`\n\n) — no API key needed — and uses `cache_scope=CacheScope.HOST`\n\nso every deployment on the same Azure resource shares one cache. API-key auth works too: Khazad matches the request path (`/chat/completions`\n\n), not the auth method or host.\n\n**OpenAI-compatible proxies** — LiteLLM, vLLM, Ollama, …\n\n``` python\nimport time\n\nfrom openai import OpenAI\n\nfrom khazad import Khazad\n\ncache = Khazad(redis_url=\"redis://localhost:6379\", threshold=0.90, namespace=\"ollama_example\")\n\nclient = OpenAI(base_url=\"http://localhost:11434/v1\", api_key=\"ollama\")\nmodel = \"llama3\"\n\nprompt = \"What is the capital of Spain?\"\n\nfor i in range(2):\n    start = time.perf_counter()\n    response = client.chat.completions.create(\n        model=model,\n        messages=[{\"role\": \"user\", \"content\": prompt}],\n    )\n    elapsed = (time.perf_counter() - start) * 1000\n    print(f\"[call {i + 1}] {elapsed:.1f}ms — {response.choices[0].message.content}\")\n\nprint(cache.get_stats().to_dict())\ncache.stop()\n```\n\nAny host whose URL path ends with `/chat/completions`\n\nor `/responses`\n\nis cached. Covers vLLM (`http://host:8000/v1/...`\n\n), Ollama (`http://localhost:11434/v1/...`\n\n), Mistral, etc.\n\n**Anthropic** — Claude via official SDK\n\n``` python\nimport os\nimport time\n\nfrom anthropic import Anthropic\n\nfrom khazad import Khazad\n\ncache = Khazad(redis_url=\"redis://localhost:6379\", threshold=0.90, namespace=\"anthropic_example\")\n\nclient = Anthropic(api_key=os.environ[\"ANTHROPIC_API_KEY\"])\nmodel = \"claude-haiku-4-5-20251001\"\n\nprompt = \"What is the capital of France?\"\n\nfor i in range(2):\n    start = time.perf_counter()\n    message = client.messages.create(\n        model=model,\n        max_tokens=256,\n        messages=[{\"role\": \"user\", \"content\": prompt}],\n    )\n    elapsed = (time.perf_counter() - start) * 1000\n    print(f\"[call {i + 1}] {elapsed:.1f}ms — {message.content[0].text}\")\n\nprint(cache.get_stats().to_dict())\ncache.stop()\n```\n\nMatches `api.anthropic.com/v1/messages`\n\n. Streaming responses replayed from cache as SSE.\n\n**Google Gemini** — `google-genai`\n\nSDK\n\n``` python\nimport os\nimport time\n\nfrom google import genai\n\nfrom khazad import Khazad\n\ncache = Khazad(redis_url=\"redis://localhost:6379\", threshold=0.90)\nclient = genai.Client(api_key=os.environ[\"GEMINI_API_KEY\"])\n\nfor i in range(2):\n    start = time.perf_counter()\n    response = client.models.generate_content(\n        model=\"gemini-2.5-flash\",\n        contents=\"What is the capital of Italy?\",\n    )\n    elapsed = (time.perf_counter() - start) * 1000\n    print(f\"[call {i + 1}] {elapsed:.1f}ms — {response.text}...\")\n\nprint(cache.get_stats().to_dict())\ncache.stop()\n```\n\nMatches `generativelanguage.googleapis.com/*/models/*:generateContent`\n\n. Gemini streaming (`:streamGenerateContent`\n\n) passes through uncached.\n\n| Provider | URL pattern matched | Streaming |\n|---|---|---|\n| OpenAI Chat Completions | any host, path ending `/chat/completions` |\ncached + replayed |\n| OpenAI Responses API | any host, path ending `/responses` |\ncached + replayed |\n| Azure OpenAI | covered by chat/completions matcher | cached + replayed |\n| OpenAI-compatible proxies | covered by chat/completions matcher | cached + replayed |\n| Anthropic | `api.anthropic.com/v1/messages` |\ncached + replayed |\n| Google Gemini | `generativelanguage.googleapis.com/*:generateContent` |\npass-through |\n\nThe module exposes five functions as the singleton API. The `Khazad`\n\nclass exposes the same surface as instance methods (no `init`\n\n, instantiation does that).\n\n| Function | Description |\n|---|---|\n`init(...)` |\nActivate the global singleton: builds the embedder, connects to Redis, installs the `httpx` transport patch. Required before any LLM traffic; calling twice without `stop()` is a no-op. See\n|\n`stop()` |\nRestore original `httpx` transports, close Redis, and clear the singleton. Idempotent. Cached data in Redis stays. |\n`get_stats()` |\nReturns a dictionary with a thread-safe snapshot of cache metrics (requests, hits, misses, hit rate, avg similarity). |\n`flush()` |\nClear all cached entries in the current namespace and reset stats counters. Destructive. |\n`is_active()` |\nReturns `True` if Khazad is currently running (initialized and not stopped). |\n\nAll parameters are the same whether you use `khazad.init()`\n\nor `Khazad(...)`\n\n:\n\n```\nKhazad(\n    redis_url=\"redis://localhost:6379\",  \n    threshold=0.90,                     \n    ttl=3600,                            \n    namespace=\"khazad\",                 \n    embedder=\"huggingface\",              \n    embedding_model=\"redis/langcache-embed-v2\",\n    log_level=\"INFO\",                    \n    hosts=None,                          \n    cache_scope=\"model\"\n)\n```\n\n| Parameter | Default | Description |\n|---|---|---|\n`redis_url` |\n`\"redis://localhost:6379\"` |\nConnection URL for the Redis 8 instance that stores vectors and cached responses. |\n`threshold` |\n`0.90` |\nCosine similarity threshold (0.0-1.0) above which a request counts as a cache hit. |\n`ttl` |\n`3600` |\nTime-to-live in seconds for cached response bodies; `None` means no expiry. |\n`namespace` |\n`\"khazad\"` |\nPrefix for all Redis keys, isolating this cache from other data and other namespaces. |\n`embedder` |\n`\"huggingface\"` |\nEmbedding backend: `\"huggingface\"` (free, local) or `\"openai\"` (paid API). |\n`embedding_model` |\n`\"redis/langcache-embed-v2\"` |\nModel used to embed prompts; must match the chosen `embedder` . |\n`log_level` |\n`\"INFO\"` |\nLogging verbosity: `DEBUG` , `INFO` , `WARNING` , or `ERROR` . |\n`hosts` |\n`None` |\nOpt-in host allowlist; `None` means all matching hosts (see below). |\n`cache_scope` |\n`\"model\"` |\nCache partitioning: `\"model\"` (per `(host, model)` ) or `\"host\"` (per provider, see below). |\n\n** hosts — opt-in allowlist.** By default Khazad considers traffic to any host that matches a provider URL pattern. Pass an explicit allowlist to restrict interception to the endpoints you intend to cache; everything else passes through untouched. Supports exact hosts and\n\n`*.`\n\nwildcard subdomains:\n\n```\nkhazad.init(hosts=[\"api.openai.com\", \"*.openai.azure.com\"])\n```\n\n** cache_scope — share one cache across a provider's models.** Driven by the\n\n`CacheScope`\n\nenum (importable from `khazad`\n\n); the string values `\"model\"`\n\nand `\"host\"`\n\nare accepted too. By default (`CacheScope.MODEL`\n\n) each `(host, model)`\n\npair gets its own vector set, so a `gpt-4o`\n\nanswer never serves a `gpt-4o-mini`\n\ncall. Set it to `CacheScope.HOST`\n\nto scope by **host only**— every model or deployment on the same provider then shares a single cache:\n\n``` python\nfrom khazad import CacheScope\n\nkhazad.init(cache_scope=CacheScope.HOST)   # or cache_scope=\"host\"\n```\n\nThe host always stays part of the scope, so different providers never mix (an Azure OpenAI response is never replayed to a Gemini client). Use it only for format-compatible pools — e.g. multiple Azure OpenAI deployments, or treating `gpt-4o`\n\nand `gpt-4o-mini`\n\nas interchangeable. The trade-off is semantic: a smaller model may serve an answer originally produced by a larger one.\n\n**Threshold guidance:**\n\n`0.95+`\n\n— strict, near-identical prompts only`0.90`\n\n— recommended default`0.85`\n\n— aggressive, higher hit rate\n\n**TTL:** the response body expires in Redis after `ttl`\n\nseconds. Khazad prunes the matching vector automatically the next time it is found without a body, so expired entries clean themselves up.\n\n| Backend | Cost | Notes |\n|---|---|---|\n`huggingface` (default) |\nFree | Downloads model on first use |\n`openai` |\nPaid | `uv add khazad[openai-embeddings]` |\n\nKhazad emits a log line for every intercepted request, so you can watch cache behaviour in real time. A hit reports the cosine similarity that triggered it and the replay latency; a miss notes that the request was forwarded upstream. Raise `log_level`\n\nto `DEBUG`\n\nfor per-request detail, or keep it at `INFO`\n\nfor just hits and misses.\n\n```\n[Khazad] CACHE HIT - Similarity: 0.94 - Latency: 4ms\n[Khazad] CACHE MISS - Forwarding to API\n```\n\nFor aggregate metrics, call `get_stats()`\n\nat any time. It returns a thread-safe snapshot of total requests, hits, misses, the resulting hit rate, and the average similarity of served hits — ideal for periodic logging or exposing as a Prometheus gauge. Watch `avg_hit_similarity`\n\n: if it hovers near your `threshold`\n\n, your traffic may be too diverse to cache safely and you should raise the threshold.\n\n```\ncache.get_stats().to_dict()\n# {'total_requests': 1000, 'cache_hits': 720, 'cache_misses': 280,\n#  'hit_rate': 0.72, 'avg_hit_similarity': 0.943}\n```\n\nContributions welcome — see [CONTRIBUTING.md](https://github.com/GuglielmoCerri/khazad/blob/main/CONTRIBUTING.md). Please read it before opening a pull request, as it covers the branching model, coding conventions, and the lint and test checks your changes are expected to pass.\n\nThe unit and integration suites use fake embedders and mock transports, so the full test run needs neither a live Redis instance nor real API keys.", "url": "https://wpnews.pro/news/show-hn-khazad-transparent-semantic-cache-for-llm-calls-on-redis-vector-sets", "canonical_source": "https://github.com/GuglielmoCerri/khazad", "published_at": "2026-06-29 21:01:04+00:00", "updated_at": "2026-06-29 21:20:03.718352+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-infrastructure"], "entities": ["Khazad", "Redis", "OpenAI", "Anthropic", "Gemini", "Mistral", "AWS Bedrock", "PyPI"], "alternates": {"html": "https://wpnews.pro/news/show-hn-khazad-transparent-semantic-cache-for-llm-calls-on-redis-vector-sets", "markdown": "https://wpnews.pro/news/show-hn-khazad-transparent-semantic-cache-for-llm-calls-on-redis-vector-sets.md", "text": "https://wpnews.pro/news/show-hn-khazad-transparent-semantic-cache-for-llm-calls-on-redis-vector-sets.txt", "jsonld": "https://wpnews.pro/news/show-hn-khazad-transparent-semantic-cache-for-llm-calls-on-redis-vector-sets.jsonld"}}