PrismLib – semantic LLM cache and cluster mesh that cuts token spend

PrismLib, a new open-source Python package, offers a semantic LLM cache, distributed database driver, and cluster intelligence mesh that can reduce token spend by 60-80% and cut DB read latency by 98.6%. The tool runs entirely in-process without external dependencies like Redis or Kubernetes, providing multi-tenant isolation and automatic failover for AI applications.

Tensor-native LLM cache, distributed DB driver, and cluster intelligence — one package. PrismLib has three layers. Use any combination: | Layer | What it solves | Key number | Install | |---|---|---|---| PrismCache | LLM API cost — semantic cache catches repeated & paraphrased queries in-process | 91–96% hit rate | pip install "prismlib cache " | PrismDriver | DB read latency — WAL-streamed local index replaces network round-trips | 98.6% latency reduction 143ms → 2ms | pip install "prismlib fabric " | PrismLib Micro | Cluster token cost + HA — shares answers across containers, auto-failover, health mesh | 76% fewer tokens cluster-wide | included in prismlib fabric | All three run entirely in-process. No Redis. No Pinecone. No Prometheus. No Kubernetes operator. Wraps any LLM call. Paraphrased queries return the cached answer without touching the API. Multi-tenant math: JL projection seeded by SHA-256 tenant id gives each tenant a mathematically isolated address space — not a query filter, a projection matrix. Two components on two machines: Server Wrapper DB node — intercepts WAL/binlog, vectorizes rows, streams encrypted float32 frames via CHORUS Fabric DLL Driver app node — subscribes to the stream, keeps a local PrismResonance index warm; reads never leave the process Built into prismlib fabric , zero extra install: ClusterCache — once any node answers a query, every peer caches it via CHORUS TOKEN SYNC frames. BLUE and ORANGE nodes billed 0 tokens on warm queries. AlertManager — 12 default health rules; fires SIGNAL frame + admin email in <1s when CPU/RAM/disk thresholds are crossed. No scrape interval. No Datadog agent. Blue/Green/Orange failover — GREEN is active master, BLUE is warm standby auto-promotes in ~3s if GREEN goes silent , ORANGE is syncing reserve. ContextCompressor — cosine-sim top-K chunk selection before every LLM call. 58–64% context token reduction, zero extra cost. Built on two open-source InsightIts libraries: — wave-memory similarity engine powering every cache lookup and local vector index PrismResonance https://github.com/insightitsGit/prismresonance — encrypted gRPC binary streaming protocol carrying float32 tensor frames between nodes CHORUS Fabric https://github.com/insightitsGit/chorus fabric Semantic LLM cache only pip install "prismlib cache " With OpenAI embeddings pip install "prismlib cache,cache-openai " With Anthropic/Voyage embeddings pip install "prismlib cache,cache-anthropic " With Ollama local models pip install "prismlib cache,cache-ollama " DB driver app node pip install "prismlib fabric " Server Wrapper daemon DB node — Linux/macOS pip install "prismlib wrapper " prism-wrapper --config /etc/prism/wrapper.toml Everything pip install "prismlib all " Save 60-80% of LLM API calls by serving semantically identical queries from cache. Paraphrases hit the cache — "How do I reset my password?" and "I forgot my password, help" return the same answer without a second LLM call. python from prism.cache import PrismCache cache = PrismCache.build tenant id="my-app", llm model="gpt-4o" def ask question: str - str: return cache.get or call query=question, call fn=lambda: openai client.chat.completions.create model="gpt-4o", messages= {"role": "user", "content": question} , .choices 0 .message.content, Each tenant gets a mathematically isolated cache space JL projection seeded by tenant ID . One customer's cached answers never bleed into another's. php from prism.cache import PrismCache def get cache tenant id: str - PrismCache: return PrismCache.build tenant id=tenant id, llm model="gpt-4o-mini" Tenant A and tenant B share no cache state cache a = get cache "acme-corp" cache b = get cache "globex-inc" answer = cache a.get or call query="What is my plan limit?", call fn=llm call Wrap your existing LLM endpoint without changing any business logic. python FastAPI from fastapi import FastAPI, Request from prism.cache import PrismCache app = FastAPI cache = PrismCache.build tenant id="api", llm model="gpt-4o" @app.post "/chat" async def chat request: Request : body = await request.json question = body "message" answer = await cache.aget or call query=question, call fn=lambda: llm client.ask question , return {"answer": answer} Django — add to MIDDLEWARE in settings.py prism/middleware.py from prism.cache import PrismCache cache = PrismCache.build tenant id="django-app", llm model="gpt-4o" class PrismCacheMiddleware: def init self, get response : self.get response = get response def call self, request : return self.get response request def process llm query self, question: str, call fn - str: return cache.get or call query=question, call fn=call fn python import asyncio from prism.cache import PrismCache cache = PrismCache.build tenant id="batch", llm model="gpt-4o-mini" async def process batch questions: list str - list str : tasks = cache.aget or call query=q, call fn=lambda q=q: llm call q for q in questions return await asyncio.gather tasks python from prism.cache import PrismCache cache = PrismCache.build tenant id="finance", llm model="gpt-4o" After processing queries... metrics = cache.metrics print f"Hit rate: {metrics.hit rate:.0%}" print f"Tokens saved: {metrics.tokens saved:,}" print f"Cost saved today: ${metrics.cost saved usd:.2f}" print f"Projected monthly: ${metrics.cost saved usd 30:.0f}" PrismDriver has two components that work together. Install each on the right machine. On the DB node — Server Wrapper The Server Wrapper is an OS daemon that sits next to your database. It reads WAL/binlog changes, vectorizes rows using RowVectorizer , encrypts them with TensorCipher via CHORUS Fabric , and streams float32 frames to every connected DLL Driver. Install on the DB node Linux or macOS pip install "prismlib wrapper " Configure and start prism-wrapper --config /etc/prism/wrapper.toml /etc/prism/wrapper.toml database flavor = "postgresql" dsn = "postgresql://user:pass@localhost/mydb" chorus listen port = 50051 tenant id = "products-service" Supported databases: PostgreSQL WAL / wal2json , MySQL binlog , CockroachDB EXPERIMENTAL CHANGEFEED , TiDB push model . On the app node — DLL Driver The DLL Driver is an in-process library that replaces your DB connection string. On startup it connects to the Server Wrapper, subscribes to the CHORUS Fabric stream, and keeps a local PrismResonance index warm. All reads hit the in-process index — no network round-trip, sub-millisecond latency. Install on the app node pip install "prismlib fabric " python Before import psycopg2 conn = psycopg2.connect "postgresql://user:secret@db-host:5432/mydb" After — no password, no hostname in app config from prism.ffi import PrismDriver, DriverConfig async with PrismDriver DriverConfig wrapper host="db-proxy-1" as driver: results = await driver.query embedding=my embedding vector, top k=5, threshold=0.85, The driver keeps a local PrismResonance cache warm via a background WAL subscription. Reads never touch the DB — they hit the in-process float32 index. python from prism.ffi import PrismDriver, DriverConfig import numpy as np config = DriverConfig wrapper host="10.0.1.50", wrapper port=50051, tenant id="products-service", async with PrismDriver config as driver: Typical hit: < 1ms, no network round-trip query vec = np.array ... , dtype=np.float32 matches = await driver.query embedding=query vec, top k=10 for m in matches: print f"{m.row id} score={m.score:.3f} {m.text repr}" async with PrismDriver config as driver: ack = await driver.write row id="product-42", data={"name": "Widget Pro", "price": 29.99, "stock": 150}, print f"Written: event id={ack.event id}" python // Go import prism "github.com/insightitsGit/prismlib/go" driver, := prism.Connect "db-proxy-1:50051", "my-tenant" defer driver.Close results, := driver.Query embedding, prism.QueryOpts{TopK: 5, Threshold: 0.85} js // C using InsightIts.Prism; await using var driver = new PrismDriver "db-proxy-1:50051", tenantId: "my-tenant" ; await driver.ConnectAsync ; var results = await driver.QueryAsync embedding, topK: 5, threshold: 0.85f ; php // PHP 8.0+ $driver = new PrismDriver 'db-proxy-1', 50051, 'my-tenant' ; $driver- connect ; $results = $driver- query $embedding, topK: 5, threshold: 0.85 ; ┌─ DB Node ──────────────────────────────────────────────────────┐ │ PostgreSQL / MySQL / CockroachDB / TiDB │ │ │ WAL / binlog / changefeed │ │ ┌────▼───────────────────────────────────────────────────┐ │ │ │ prism-wrapper pip install "prismlib wrapper " │ │ │ │ RowVectorizer → TensorCipher V enc = V @ K │ │ │ │ → HMAC-SHA256 watermark → CHORUSPublisher │ │ │ └────────────────────────┬───────────────────────────────┘ │ └───────────────────────────┼────────────────────────────────────┘ │ CHORUS Fabric gRPC, encrypted float32 ┌─ App Node — GREEN ────────┼────────────────────────────────────┐ │ ┌────────────────────────▼──────────────────────────────┐ │ │ │ PrismDriver DLL pip install "prismlib fabric " │ │ │ │ Subscribe loop → decrypt → PrismResonance index │ │ │ └──────────────────────────┬────────────────────────────┘ │ │ │ sub-ms query │ │ ┌──────────────────────────▼────────────────────────────┐ │ │ │ Your Application │ │ │ │ ┌─────────────────┐ ┌──────────────────────────┐ │ │ │ │ │ PrismCache │ │ PrismDriver │ │ │ │ │ │ LLM cache │ │ local PrismResonance │ │ │ │ │ │ cache │ │ no DB round-trip │ │ │ │ │ └─────────────────┘ └──────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────┐ │ │ │ │ │ ClusterCache ← TOKEN SYNC frames │ │ │ │ │ │ AlertManager ← HEALTH / SIGNAL frames │ │ │ │ │ └──────────────────────────────────────────────────┘ │ │ │ └────────────────────────────────────────────────────────┘ │ └──────────────────────────────┬─────────────────────────────────┘ │ CHORUS mesh ┌────────────────────┴────────────────────┐ │ TOKEN SYNC · HEALTH · SIGNAL · CONFIG │ ▼ ▼ ┌─ App Node — BLUE ──────┐ ┌─ App Node — ORANGE ─────┐ │ ClusterCache │ │ ClusterCache │ │ warm standby │ │ syncing reserve │ │ auto-promotes if │ │ separate network │ │ GREEN silent 3s │ │ │ └────────────────────────┘ └──────────────────────────┘ Live results from Azure Container App westus2 , 1 vCPU / 2 GiB, mock LLM baseline : | Scenario | Users | Duration | Hit rate | Queries | Tokens saved | Monthly est. | |---|---|---|---|---|---|---| | Light | 20 | 60s | 91.0% | 5,936 | 1,374,464 | $594 | | Mixed | 50 | 300s | 95.9% | 6,973 | 1,673,216 | $723 | Numbers use a mock LLM 80ms sleep . With real GPT-4o calls 1–3s , latency speedup is 4–13×; token savings are identical. Live two-node benchmark Azure Container Apps westus2 , 30 users × 60s per phase : | Phase | Path | Avg latency | Queries | |---|---|---|---| Baseline no driver | App → DB node, network | 142.8 ms | 3,864 | Driver local index | App → in-process PrismResonance | 2.0 ms | 1,479 | 70.7× faster · 98.6% latency reduction The 98.6% reduction is a direct result of CHORUS Fabric doing its job. The subscription loop streamed 11,000 rows at 26,000 rows/s from the DB node into the local PrismResonance index before the load test began. By the time the first /driver/query hit arrived, there were zero network hops — the answer was already in-process. This is what CHORUS Fabric was designed for: getting tensor data to where the query is, before the query arrives. Two-node benchmark requires both container apps running python benchmark/load/run driver benchmark.py \ --app-url https://prism-benchmark.nicestone-720c6a9b.westus2.azurecontainerapps.io \ --db-url https://prism-wrapper-sim.nicestone-720c6a9b.westus2.azurecontainerapps.io \ --users 30 --duration 60 PrismCache load test python benchmark/load/run benchmark.py \ --host https://prism-benchmark.nicestone-720c6a9b.westus2.azurecontainerapps.io \ --scenario mixed See benchmark/ /insightitsGit/prismlib/blob/master/benchmark for full results JSON, Locust CSV files, and the Azure deploy script. PrismLib is built on two InsightIts open-source libraries. You can use them directly if you need lower-level access. · github.com/insightitsGit/prismresonance pip install prismresonance The wave-memory similarity engine. Every cache lookup and local vector index in PrismLib goes through PrismResonance. How it works: - Receives a float32 embedding vector - Johnson-Lindenstrauss reduces it to 64 dimensions using a projection matrix seeded by SHA-256 tenant id — this is what gives each tenant mathematically isolated address space - Computes similarity as wave interference cosine in projected space in three lock-free phases: snapshot → ONNX MatMul → rank - Returns ranked candidates in sub-millisecond time entirely in-process PrismCache wraps this for LLM response caching. PrismDriver's local replica is a PrismResonance index kept warm by WAL streaming. python from prismresonance import PrismProjector, WaveIndex projector = PrismProjector dim=64, tenant id="my-tenant" index = WaveIndex projector index.add vector=my embedding, payload={"row id": "product-1", "text": "Widget"} results = index.query vector=query embedding, top k=5, threshold=0.85 · github.com/insightitsGit/chorus fabric pip install chorus-fabric The secure gRPC binary streaming protocol for machine-to-machine tensor communication. PrismDriver uses CHORUS Fabric as its transport layer between the server wrapper on the DB node and the DLL driver on the app node. How it works: prism-wrapper DB node vectorizes WAL row events via RowVectorizer , encrypts them with TensorCipher V enc = V @ K , appends an HMAC-SHA256 watermark, and publishes batches of raw float32 frames PrismDriver app node opens a persistent WrapperService.Subscribe gRPC stream, receives encrypted frames, decrypts, and feeds them into the local PrismResonance index- Transport is pure binary float32 over gRPC server-streaming — no JSON serialization, no REST overhead - The WrapperService proto also exposes Query , Write , Health , and Hello RPCs for direct interaction python from chorus fabric import CHORUSPublisher, DriverEndpoint publisher = CHORUSPublisher config publisher.add driver DriverEndpoint host="10.0.1.50", port=50051, tenant id="prod" await publisher.run event queue streams WAL events to all connected drivers CHORUS Fabric is the same protocol used in the CHORUS M2M system — InsightIts' 4-container gRPC topology for tensor communication between AI agents. The 98.6% latency reduction in the PrismDriver benchmark is direct proof that the protocol works at production scale: 11,000 rows streamed at 26,000 rows/s across Azure inter-container networking, then served locally at 2ms. PrismLib Micro is the cluster layer built into prismlib fabric . It adds three capabilities on top of the single-node stack — no extra install, no extra infra. | Component | What it does | |---|---| ClusterCache | Shares LLM answers across all nodes via CHORUS TOKEN SYNC frames. Once any node answers a query, every other node serves it for 0 tokens. | AlertManager | Broadcasts health alerts as SIGNAL frames + admin email the moment CPU/RAM/disk/latency thresholds are crossed. No Prometheus. No Datadog. | Blue/Green/Orange failover | Three-tier hot-standby: GREEN active , BLUE warm standby, auto-promotes in ~3s , ORANGE syncing reserve . No Raft dependency. No K8s operator. | ContextCompressor | Ranks RAG context chunks by cosine similarity, keeps top-K. Saves 58–64% of context tokens before every LLM call. In-process, no extra model. | | Metric | Result | |---|---| | Token savings — cluster avg | 76.1% | | BLUE node cluster cache hit | 100% — 0 LLM calls | | ORANGE node cross-network cache hit | 100% — 0 LLM calls | | Context compression | 58–64% per query | | Health alert propagation | <1 s 709–711 ms measured | | Failover — BLUE promoted to GREEN | ~3–4 s, no human step | See benchmark/cluster/ /insightitsGit/prismlib/blob/master/benchmark/cluster for the full benchmark code and for raw results. /insightitsGit/prismlib/blob/master/benchmark/cluster/cluster benchmark results.json benchmark/cluster/cluster benchmark results.json python from prism.cluster.cache import ClusterCache cache = ClusterCache node id="node-1", fabric=chorus fabric answer = await cache.get or call query = user question, query vector = embed user question , call fn = lambda: llm.complete user question , context chunks = retrieved docs, your RAG chunks chunk vectors = doc embeddings, their vectors Drop this in front of your existing retrieve → generate step. No changes to retrieval logic, no changes to your LLM client. python from prism.cluster.alerts import AlertManager, SMTPConfig alerts = AlertManager fabric = chorus fabric, mail config = SMTPConfig host="smtp.gmail.com", port=587, username="you@gmail.com", password=os.getenv "GMAIL APP PASS" , recipients= "admin@yourcompany.com" , , await alerts.evaluate health health snapshot Fires email + SIGNAL frame to all nodes if any of the 12 default rules trigger | Capability | PrismLib Micro | Prometheus + Alertmanager | Redis cluster | Raft / etcd | |---|---|---|---|---| | Cross-node token cache | Yes, built-in | No | Manual exact match | No | | Alert propagation | <1 s, no infra | 30–60 s, stack needed | No | No | | Auto failover | ~3–4 s, built-in | No | Sentinel, 2–30 s | 150–500 ms | | Context compression | 58–64%, free | No | No | No | | Extra infrastructure | None | Prometheus stack | Redis cluster | etcd cluster | | Tier | Nodes | Price | Includes | |---|---|---|---| Open source | Unlimited | Free forever | All cluster code, Apache 2.0 | ChorusMesh Developer coming soon | Up to 3 | $29/mo after 30-day trial | ClusterCache + failover + AlertManager | ChorusMesh Team | Up to 10 | $149/mo | + Raft consensus, message broker adapters | ChorusMesh Business | Up to 50 | $499/mo | + multi-region routing, SLA 99.9% | Enterprise | Unlimited | Contact us | + air-gap, compliance, dedicated Slack | For enterprise agreements: insightits.info@gmail.com mailto:insightits.info@gmail.com PrismLib is open source Apache 2.0 and free to use. If your team needs any of the following, contact us for enterprise pricing: On-premises deployment support — air-gapped installs, hardened Docker images, SOC 2 documentation SLA-backed support — guaranteed response times, incident escalation, dedicated Slack channel Custom embedding model integration — fine-tuned domain-specific embedders for higher hit rates in specialized domains legal, medical, finance, code Multi-region CHORUS Fabric topology — active-active DB node clusters, cross-region WAL fan-out, geo-aware driver routing Audit logging and compliance exports — per-query access logs, tenant isolation attestation reports, GDPR data lineage Professional services — architecture review, migration from Redis/GPTCache, custom RowVectorizer schemas Contact: insightits.info@gmail.com GitHub: github.com/insightitsGit/prismlib https://github.com/insightitsGit/prismlib PrismLib is free and will stay free. If it saved your team money on OpenAI bills or database infrastructure, consider sponsoring — it covers benchmark compute, maintenance time, and keeps development moving. Your name or logo here — become a sponsor It is one package — prismlib — published once. The wrapper, driver, and cache are all extras of the same package. Users install what they need: pip install "prismlib cache " PrismCache only pip install "prismlib wrapper " Server Wrapper DB node pip install "prismlib fabric " DLL Driver App node pip install "prismlib all " Everything To publish a new version: 1. Bump version in pyproject.toml currently 0.4.0 2. Build the distribution pip install build twine python -m build 3. Upload to PyPI use your token from pypi.org/manage/account/token/ python -m twine upload dist/ --username token --password pypi-YOUR TOKEN That's it. One upload covers all three install variants — PyPI resolves the extras automatically. Apache 2.0 — InsightIts © 2026