Tensor-native LLM cache, distributed DB driver, and cluster intelligence — one package.
PrismLib has three layers. Use any combination:
| Layer | What it solves | Key number | Install |
|---|---|---|---|
| PrismCache | |||
| LLM API cost — semantic cache catches repeated & paraphrased queries in-process | 91–96% hit rate | ||
pip install "prismlib[cache]" |
|||
| PrismDriver | |||
| DB read latency — WAL-streamed local index replaces network round-trips | 98.6% latency reduction (143ms → 2ms) | ||
pip install "prismlib[fabric]" |
|||
| PrismLib Micro | |||
| Cluster token cost + HA — shares answers across containers, auto-failover, health mesh | 76% fewer tokens cluster-wide | ||
included in prismlib[fabric] |
All three run entirely in-process. No Redis. No Pinecone. No Prometheus. No Kubernetes operator.
Wraps any LLM call. Paraphrased queries return the cached answer without touching the API.
Multi-tenant math: JL projection seeded by SHA-256(tenant_id)
gives each tenant a mathematically isolated address space — not a query filter, a projection matrix.
Two components on two machines:
Server Wrapper(DB node) — intercepts WAL/binlog, vectorizes rows, streams encrypted float32 frames via CHORUS Fabric** DLL Driver**(app node) — subscribes to the stream, keeps a local PrismResonance index warm; reads never leave the process
Built into prismlib[fabric]
, zero extra install:
ClusterCache— once any node answers a query, every peer caches it via CHORUS TOKEN_SYNC frames. BLUE and ORANGE nodes billed 0 tokens on warm queries.AlertManager— 12 default health rules; fires SIGNAL frame + admin email in <1s when CPU/RAM/disk thresholds are crossed. No scrape interval. No Datadog agent.Blue/Green/Orange failover— GREEN is active master, BLUE is warm standby (auto-promotes in ~3s if GREEN goes silent), ORANGE is syncing reserve.** ContextCompressor**— cosine-sim top-K chunk selection before every LLM call. 58–64% context token reduction, zero extra cost.
Built on two open-source InsightIts libraries:
— wave-memory similarity engine powering every cache lookup and local vector indexPrismResonance— encrypted gRPC binary streaming protocol carrying float32 tensor frames between nodesCHORUS Fabric
pip install "prismlib[cache]"
pip install "prismlib[cache,cache-openai]"
pip install "prismlib[cache,cache-anthropic]"
pip install "prismlib[cache,cache-ollama]"
pip install "prismlib[fabric]"
pip install "prismlib[wrapper]"
prism-wrapper --config /etc/prism/wrapper.toml
pip install "prismlib[all]"
Save 60-80% of LLM API calls by serving semantically identical queries from cache. Paraphrases hit the cache — "How do I reset my password?" and "I forgot my password, help" return the same answer without a second LLM call.
from prism.cache import PrismCache
cache = PrismCache.build(tenant_id="my-app", llm_model="gpt-4o")
def ask(question: str) -> str:
return cache.get_or_call(
query=question,
call_fn=lambda: openai_client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": question}],
).choices[0].message.content,
)
Each tenant gets a mathematically isolated cache space (JL projection seeded by tenant ID). One customer's cached answers never bleed into another's.
from prism.cache import PrismCache
def get_cache(tenant_id: str) -> PrismCache:
return PrismCache.build(tenant_id=tenant_id, llm_model="gpt-4o-mini")
cache_a = get_cache("acme-corp")
cache_b = get_cache("globex-inc")
answer = cache_a.get_or_call(query="What is my plan limit?", call_fn=llm_call)
Wrap your existing LLM endpoint without changing any business logic.
from fastapi import FastAPI, Request
from prism.cache import PrismCache
app = FastAPI()
cache = PrismCache.build(tenant_id="api", llm_model="gpt-4o")
@app.post("/chat")
async def chat(request: Request):
body = await request.json()
question = body["message"]
answer = await cache.aget_or_call(
query=question,
call_fn=lambda: llm_client.ask(question),
)
return {"answer": answer}
from prism.cache import PrismCache
_cache = PrismCache.build(tenant_id="django-app", llm_model="gpt-4o")
class PrismCacheMiddleware:
def __init__(self, get_response):
self.get_response = get_response
def __call__(self, request):
return self.get_response(request)
def process_llm_query(self, question: str, call_fn) -> str:
return _cache.get_or_call(query=question, call_fn=call_fn)
python
import asyncio
from prism.cache import PrismCache
cache = PrismCache.build(tenant_id="batch", llm_model="gpt-4o-mini")
async def process_batch(questions: list[str]) -> list[str]:
tasks = [
cache.aget_or_call(query=q, call_fn=lambda q=q: llm_call(q))
for q in questions
]
return await asyncio.gather(*tasks)
python
from prism.cache import PrismCache
cache = PrismCache.build(tenant_id="finance", llm_model="gpt-4o")
metrics = cache.metrics()
print(f"Hit rate: {metrics.hit_rate:.0%}")
print(f"Tokens saved: {metrics.tokens_saved:,}")
print(f"Cost saved today: ${metrics.cost_saved_usd:.2f}")
print(f"Projected monthly: ${metrics.cost_saved_usd * 30:.0f}")
PrismDriver has two components that work together. Install each on the right machine.
On the DB node — Server Wrapper
The Server Wrapper is an OS daemon that sits next to your database. It reads WAL/binlog changes, vectorizes rows using RowVectorizer
, encrypts them with TensorCipher
(via CHORUS Fabric), and streams float32 frames to every connected DLL Driver.
pip install "prismlib[wrapper]"
prism-wrapper --config /etc/prism/wrapper.toml
[database]
flavor = "postgresql"
dsn = "postgresql://user:pass@localhost/mydb"
[chorus]
listen_port = 50051
tenant_id = "products-service"
Supported databases: PostgreSQL (WAL / wal2json), MySQL (binlog), CockroachDB (EXPERIMENTAL CHANGEFEED), TiDB (push model).
On the app node — DLL Driver
The DLL Driver is an in-process library that replaces your DB connection string. On startup it connects to the Server Wrapper, subscribes to the CHORUS Fabric stream, and keeps a local PrismResonance index warm. All reads hit the in-process index — no network round-trip, sub-millisecond latency.
pip install "prismlib[fabric]"
python
import psycopg2
conn = psycopg2.connect("postgresql://user:secret@db-host:5432/mydb")
from prism.ffi import PrismDriver, DriverConfig
async with PrismDriver(DriverConfig(wrapper_host="db-proxy-1")) as driver:
results = await driver.query(
embedding=my_embedding_vector,
top_k=5,
threshold=0.85,
)
The driver keeps a local PrismResonance cache warm via a background WAL subscription. Reads never touch the DB — they hit the in-process float32 index.
from prism.ffi import PrismDriver, DriverConfig
import numpy as np
config = DriverConfig(
wrapper_host="10.0.1.50",
wrapper_port=50051,
tenant_id="products-service",
)
async with PrismDriver(config) as driver:
query_vec = np.array([...], dtype=np.float32)
matches = await driver.query(embedding=query_vec, top_k=10)
for m in matches:
print(f"{m.row_id} score={m.score:.3f} {m.text_repr}")
async with PrismDriver(config) as driver:
ack = await driver.write(
row_id="product-42",
data={"name": "Widget Pro", "price": 29.99, "stock": 150},
)
print(f"Written: event_id={ack.event_id}")
python
// Go
import prism "github.com/insightitsGit/prismlib/go"
driver, _ := prism.Connect("db-proxy-1:50051", "my-tenant")
defer driver.Close()
results, _ := driver.Query(embedding, prism.QueryOpts{TopK: 5, Threshold: 0.85})
js
// C#
using InsightIts.Prism;
await using var driver = new PrismDriver("db-proxy-1:50051", tenantId: "my-tenant");
await driver.ConnectAsync();
var results = await driver.QueryAsync(embedding, topK: 5, threshold: 0.85f);
php
// PHP 8.0+
$driver = new PrismDriver('db-proxy-1', 50051, 'my-tenant');
$driver->connect();
$results = $driver->query($embedding, topK: 5, threshold: 0.85);
┌─ DB Node ──────────────────────────────────────────────────────┐
│ PostgreSQL / MySQL / CockroachDB / TiDB │
│ │ WAL / binlog / changefeed │
│ ┌────▼───────────────────────────────────────────────────┐ │
│ │ prism-wrapper (pip install "prismlib[wrapper]") │ │
│ │ RowVectorizer → TensorCipher (V_enc = V @ K) │ │
│ │ → HMAC-SHA256 watermark → CHORUSPublisher │ │
│ └────────────────────────┬───────────────────────────────┘ │
└───────────────────────────┼────────────────────────────────────┘
│ CHORUS Fabric (gRPC, encrypted float32)
┌─ App Node — GREEN ────────┼────────────────────────────────────┐
│ ┌────────────────────────▼──────────────────────────────┐ │
│ │ PrismDriver DLL (pip install "prismlib[fabric]") │ │
│ │ Subscribe loop → decrypt → PrismResonance index │ │
│ └──────────────────────────┬────────────────────────────┘ │
│ │ sub-ms query │
│ ┌──────────────────────────▼────────────────────────────┐ │
│ │ Your Application │ │
│ │ ┌─────────────────┐ ┌──────────────────────────┐ │ │
│ │ │ PrismCache │ │ PrismDriver │ │ │
│ │ │ LLM cache │ │ local PrismResonance │ │ │
│ │ │ [cache] │ │ (no DB round-trip) │ │ │
│ │ └─────────────────┘ └──────────────────────────┘ │ │
│ │ ┌──────────────────────────────────────────────────┐ │ │
│ │ │ ClusterCache ← TOKEN_SYNC frames │ │ │
│ │ │ AlertManager ← HEALTH / SIGNAL frames │ │ │
│ │ └──────────────────────────────────────────────────┘ │ │
│ └────────────────────────────────────────────────────────┘ │
└──────────────────────────────┬─────────────────────────────────┘
│ CHORUS mesh
┌────────────────────┴────────────────────┐
│ TOKEN_SYNC · HEALTH · SIGNAL · CONFIG │
▼ ▼
┌─ App Node — BLUE ──────┐ ┌─ App Node — ORANGE ─────┐
│ ClusterCache │ │ ClusterCache │
│ (warm standby) │ │ (syncing reserve) │
│ auto-promotes if │ │ separate network │
│ GREEN silent >3s │ │ │
└────────────────────────┘ └──────────────────────────┘
Live results from Azure Container App (westus2
, 1 vCPU / 2 GiB, mock LLM baseline):
| Scenario | Users | Duration | Hit rate | Queries | Tokens saved | Monthly est. |
|---|---|---|---|---|---|---|
| Light | 20 | 60s | 91.0% | |||
| 5,936 | 1,374,464 | $594 | ||||
| Mixed | 50 | 300s | 95.9% | |||
| 6,973 | 1,673,216 | $723 |
Numbers use a mock LLM (80ms sleep). With real GPT-4o calls (1–3s), latency speedup is 4–13×; token savings are identical.
Live two-node benchmark (Azure Container Apps westus2
, 30 users × 60s per phase):
| Phase | Path | Avg latency | Queries |
|---|---|---|---|
| Baseline (no driver) | |||
| App → DB node, network | 142.8 ms | ||
| 3,864 | |||
| Driver (local index) | |||
| App → in-process PrismResonance | 2.0 ms | ||
| 1,479 |
70.7× faster · 98.6% latency reduction
The 98.6% reduction is a direct result of CHORUS Fabric doing its job. The subscription loop streamed 11,000 rows at 26,000 rows/s from the DB node into the local PrismResonance index before the load test began. By the time the first /driver/query
hit arrived, there were zero network hops — the answer was already in-process. This is what CHORUS Fabric was designed for: getting tensor data to where the query is, before the query arrives.
python benchmark/load/run_driver_benchmark.py \
--app-url https://prism-benchmark.nicestone-720c6a9b.westus2.azurecontainerapps.io \
--db-url https://prism-wrapper-sim.nicestone-720c6a9b.westus2.azurecontainerapps.io \
--users 30 --duration 60
python benchmark/load/run_benchmark.py \
--host https://prism-benchmark.nicestone-720c6a9b.westus2.azurecontainerapps.io \
--scenario mixed
See benchmark/ for full results JSON, Locust CSV files, and the Azure deploy script.
PrismLib is built on two InsightIts open-source libraries. You can use them directly if you need lower-level access.
·[github.com/insightitsGit/prismresonance]pip install prismresonance
The wave-memory similarity engine. Every cache lookup and local vector index in PrismLib goes through PrismResonance.
How it works:
- Receives a float32 embedding vector
- Johnson-Lindenstrauss reduces it to 64 dimensions using a projection matrix seeded by
SHA-256(tenant_id)
— this is what gives each tenant mathematically isolated address space - Computes similarity as wave interference (cosine in projected space) in three lock-free phases: snapshot → ONNX MatMul → rank
- Returns ranked candidates in sub-millisecond time entirely in-process
PrismCache wraps this for LLM response caching. PrismDriver's local replica is a PrismResonance index kept warm by WAL streaming.
from prismresonance import PrismProjector, WaveIndex
projector = PrismProjector(dim=64, tenant_id="my-tenant")
index = WaveIndex(projector)
index.add(vector=my_embedding, payload={"row_id": "product-1", "text": "Widget"})
results = index.query(vector=query_embedding, top_k=5, threshold=0.85)
·[github.com/insightitsGit/chorus_fabric]pip install chorus-fabric
The secure gRPC binary streaming protocol for machine-to-machine tensor communication. PrismDriver uses CHORUS Fabric as its transport layer between the server wrapper on the DB node and the DLL driver on the app node.
How it works:
prism-wrapper
(DB node) vectorizes WAL row events viaRowVectorizer
, encrypts them withTensorCipher
(V_enc = V @ K
), appends an HMAC-SHA256 watermark, and publishes batches of raw float32 framesPrismDriver
(app node) opens a persistentWrapperService.Subscribe()
gRPC stream, receives encrypted frames, decrypts, and feeds them into the local PrismResonance index- Transport is pure binary float32 over gRPC server-streaming — no JSON serialization, no REST overhead
- The
WrapperService
proto also exposesQuery
,Write
,Health
, andHello
RPCs for direct interaction
from chorus_fabric import CHORUSPublisher, DriverEndpoint
publisher = CHORUSPublisher(config)
publisher.add_driver(DriverEndpoint(host="10.0.1.50", port=50051, tenant_id="prod"))
await publisher.run(event_queue) # streams WAL events to all connected drivers
CHORUS Fabric is the same protocol used in the CHORUS M2M system — InsightIts' 4-container gRPC topology for tensor communication between AI agents. The 98.6% latency reduction in the PrismDriver benchmark is direct proof that the protocol works at production scale: 11,000 rows streamed at 26,000 rows/s across Azure inter-container networking, then served locally at 2ms.
PrismLib Micro is the cluster layer built into prismlib[fabric]
. It adds three capabilities on top of the single-node stack — no extra install, no extra infra.
| Component | What it does |
|---|---|
| ClusterCache | |
| Shares LLM answers across all nodes via CHORUS TOKEN_SYNC frames. Once any node answers a query, every other node serves it for 0 tokens. | |
| AlertManager | |
| Broadcasts health alerts as SIGNAL frames + admin email the moment CPU/RAM/disk/latency thresholds are crossed. No Prometheus. No Datadog. | |
| Blue/Green/Orange failover | |
| Three-tier hot-standby: GREEN (active), BLUE (warm standby, auto-promotes in ~3s), ORANGE (syncing reserve). No Raft dependency. No K8s operator. | |
| ContextCompressor | |
| Ranks RAG context chunks by cosine similarity, keeps top-K. Saves 58–64% of context tokens before every LLM call. In-process, no extra model. |
| Metric | Result |
|---|---|
| Token savings — cluster avg | 76.1% |
| BLUE node (cluster cache hit) | 100% — 0 LLM calls |
| ORANGE node (cross-network cache hit) | 100% — 0 LLM calls |
| Context compression | 58–64% per query |
| Health alert propagation | <1 s (709–711 ms measured) |
| Failover — BLUE promoted to GREEN | ~3–4 s, no human step |
See benchmark/cluster/ for the full benchmark code and
benchmark/cluster/cluster_benchmark_results.json
from prism.cluster.cache import ClusterCache
cache = ClusterCache(node_id="node-1", fabric=chorus_fabric)
answer = await cache.get_or_call(
query = user_question,
query_vector = embed(user_question),
call_fn = lambda: llm.complete(user_question),
context_chunks = retrieved_docs, # your RAG chunks
chunk_vectors = doc_embeddings, # their vectors
)
Drop this in front of your existing retrieve → generate
step. No changes to retrieval logic, no changes to your LLM client.
from prism.cluster.alerts import AlertManager, SMTPConfig
alerts = AlertManager(
fabric = chorus_fabric,
mail_config = SMTPConfig(
host="smtp.gmail.com", port=587,
username="you@gmail.com",
password=os.getenv("GMAIL_APP_PASS"),
recipients=["admin@yourcompany.com"],
),
)
await alerts.evaluate_health(health_snapshot)
| Capability | PrismLib Micro | Prometheus + Alertmanager | Redis cluster | Raft / etcd |
|---|---|---|---|---|
| Cross-node token cache | Yes, built-in | |||
| No | Manual (exact match) | No | ||
| Alert propagation | <1 s, no infra | |||
| 30–60 s, stack needed | No | No | ||
| Auto failover | ~3–4 s, built-in | |||
| No | Sentinel, 2–30 s | 150–500 ms | ||
| Context compression | 58–64%, free | |||
| No | No | No | ||
| Extra infrastructure | None | |||
| Prometheus stack | Redis cluster | etcd cluster |
| Tier | Nodes | Price | Includes |
|---|---|---|---|
| Open source | |||
| Unlimited | Free forever | ||
| All cluster code, Apache 2.0 | |||
| ChorusMesh Developer (coming soon) | |||
| Up to 3 | $29/mo after 30-day trial | ClusterCache + failover + AlertManager | |
| ChorusMesh Team | |||
| Up to 10 | $149/mo | + Raft consensus, message broker adapters | |
| ChorusMesh Business | |||
| Up to 50 | $499/mo | + multi-region routing, SLA 99.9% | |
| Enterprise | |||
| Unlimited | Contact us | + air-gap, compliance, dedicated Slack |
For enterprise agreements: insightits.info@gmail.com
PrismLib is open source (Apache 2.0) and free to use. If your team needs any of the following, contact us for enterprise pricing:
On-premises deployment support— air-gapped installs, hardened Docker images, SOC 2 documentation** SLA-backed support**— guaranteed response times, incident escalation, dedicated Slack channel** Custom embedding model integration**— fine-tuned domain-specific embedders for higher hit rates in specialized domains (legal, medical, finance, code)** Multi-region CHORUS Fabric topology**— active-active DB node clusters, cross-region WAL fan-out, geo-aware driver routing** Audit logging and compliance exports**— per-query access logs, tenant isolation attestation reports, GDPR data lineage** Professional services**— architecture review, migration from Redis/GPTCache, custom RowVectorizer schemas
Contact: insightits.info@gmail.com
GitHub:
github.com/insightitsGit/prismlibPrismLib is free and will stay free. If it saved your team money on OpenAI bills or database infrastructure, consider sponsoring — it covers benchmark compute, maintenance time, and keeps development moving.
Your name or logo here — become a sponsor
It is one package — prismlib
— published once. The wrapper, driver, and cache are all extras of the same package. Users install what they need:
pip install "prismlib[cache]" # PrismCache only
pip install "prismlib[wrapper]" # Server Wrapper (DB node)
pip install "prismlib[fabric]" # DLL Driver (App node)
pip install "prismlib[all]" # Everything
To publish a new version:
pip install build twine
python -m build
python -m twine upload dist/* --username __token__ --password pypi-YOUR_TOKEN
That's it. One upload covers all three install variants — PyPI resolves the extras automatically.
Apache 2.0 — InsightIts © 2026