cd /news/large-language-models/prismlib-semantic-llm-cache-and-clus… · home topics large-language-models article
[ARTICLE · art-41488] src=github.com ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

PrismLib – semantic LLM cache and cluster mesh that cuts token spend

PrismLib, a new open-source Python package, offers a semantic LLM cache, distributed database driver, and cluster intelligence mesh that can reduce token spend by 60-80% and cut DB read latency by 98.6%. The tool runs entirely in-process without external dependencies like Redis or Kubernetes, providing multi-tenant isolation and automatic failover for AI applications.

read13 min views1 publishedJun 27, 2026
PrismLib – semantic LLM cache and cluster mesh that cuts token spend
Image: source

Tensor-native LLM cache, distributed DB driver, and cluster intelligence — one package.

PrismLib has three layers. Use any combination:

Layer What it solves Key number Install
PrismCache
LLM API cost — semantic cache catches repeated & paraphrased queries in-process 91–96% hit rate
pip install "prismlib[cache]"
PrismDriver
DB read latency — WAL-streamed local index replaces network round-trips 98.6% latency reduction (143ms → 2ms)
pip install "prismlib[fabric]"
PrismLib Micro
Cluster token cost + HA — shares answers across containers, auto-failover, health mesh 76% fewer tokens cluster-wide
included in prismlib[fabric]

All three run entirely in-process. No Redis. No Pinecone. No Prometheus. No Kubernetes operator.

Wraps any LLM call. Paraphrased queries return the cached answer without touching the API. Multi-tenant math: JL projection seeded by SHA-256(tenant_id)

gives each tenant a mathematically isolated address space — not a query filter, a projection matrix.

Two components on two machines:

Server Wrapper(DB node) — intercepts WAL/binlog, vectorizes rows, streams encrypted float32 frames via CHORUS Fabric** DLL Driver**(app node) — subscribes to the stream, keeps a local PrismResonance index warm; reads never leave the process

Built into prismlib[fabric]

, zero extra install:

ClusterCache— once any node answers a query, every peer caches it via CHORUS TOKEN_SYNC frames. BLUE and ORANGE nodes billed 0 tokens on warm queries.AlertManager— 12 default health rules; fires SIGNAL frame + admin email in <1s when CPU/RAM/disk thresholds are crossed. No scrape interval. No Datadog agent.Blue/Green/Orange failover— GREEN is active master, BLUE is warm standby (auto-promotes in ~3s if GREEN goes silent), ORANGE is syncing reserve.** ContextCompressor**— cosine-sim top-K chunk selection before every LLM call. 58–64% context token reduction, zero extra cost.

Built on two open-source InsightIts libraries:

— wave-memory similarity engine powering every cache lookup and local vector indexPrismResonance— encrypted gRPC binary streaming protocol carrying float32 tensor frames between nodesCHORUS Fabric

pip install "prismlib[cache]"

pip install "prismlib[cache,cache-openai]"

pip install "prismlib[cache,cache-anthropic]"

pip install "prismlib[cache,cache-ollama]"

pip install "prismlib[fabric]"

pip install "prismlib[wrapper]"
prism-wrapper --config /etc/prism/wrapper.toml

pip install "prismlib[all]"

Save 60-80% of LLM API calls by serving semantically identical queries from cache. Paraphrases hit the cache — "How do I reset my password?" and "I forgot my password, help" return the same answer without a second LLM call.

from prism.cache import PrismCache

cache = PrismCache.build(tenant_id="my-app", llm_model="gpt-4o")

def ask(question: str) -> str:
    return cache.get_or_call(
        query=question,
        call_fn=lambda: openai_client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": question}],
        ).choices[0].message.content,
    )

Each tenant gets a mathematically isolated cache space (JL projection seeded by tenant ID). One customer's cached answers never bleed into another's.

from prism.cache import PrismCache

def get_cache(tenant_id: str) -> PrismCache:
    return PrismCache.build(tenant_id=tenant_id, llm_model="gpt-4o-mini")

cache_a = get_cache("acme-corp")
cache_b = get_cache("globex-inc")

answer = cache_a.get_or_call(query="What is my plan limit?", call_fn=llm_call)

Wrap your existing LLM endpoint without changing any business logic.

from fastapi import FastAPI, Request
from prism.cache import PrismCache

app = FastAPI()
cache = PrismCache.build(tenant_id="api", llm_model="gpt-4o")

@app.post("/chat")
async def chat(request: Request):
    body = await request.json()
    question = body["message"]
    answer = await cache.aget_or_call(
        query=question,
        call_fn=lambda: llm_client.ask(question),
    )
    return {"answer": answer}
from prism.cache import PrismCache

_cache = PrismCache.build(tenant_id="django-app", llm_model="gpt-4o")

class PrismCacheMiddleware:
    def __init__(self, get_response):
        self.get_response = get_response

    def __call__(self, request):
        return self.get_response(request)

    def process_llm_query(self, question: str, call_fn) -> str:
        return _cache.get_or_call(query=question, call_fn=call_fn)
python
import asyncio
from prism.cache import PrismCache

cache = PrismCache.build(tenant_id="batch", llm_model="gpt-4o-mini")

async def process_batch(questions: list[str]) -> list[str]:
    tasks = [
        cache.aget_or_call(query=q, call_fn=lambda q=q: llm_call(q))
        for q in questions
    ]
    return await asyncio.gather(*tasks)
python
from prism.cache import PrismCache

cache = PrismCache.build(tenant_id="finance", llm_model="gpt-4o")

metrics = cache.metrics()
print(f"Hit rate:          {metrics.hit_rate:.0%}")
print(f"Tokens saved:      {metrics.tokens_saved:,}")
print(f"Cost saved today:  ${metrics.cost_saved_usd:.2f}")
print(f"Projected monthly: ${metrics.cost_saved_usd * 30:.0f}")

PrismDriver has two components that work together. Install each on the right machine.

On the DB node — Server Wrapper

The Server Wrapper is an OS daemon that sits next to your database. It reads WAL/binlog changes, vectorizes rows using RowVectorizer

, encrypts them with TensorCipher

(via CHORUS Fabric), and streams float32 frames to every connected DLL Driver.

pip install "prismlib[wrapper]"

prism-wrapper --config /etc/prism/wrapper.toml
[database]
flavor = "postgresql"
dsn = "postgresql://user:pass@localhost/mydb"

[chorus]
listen_port = 50051
tenant_id = "products-service"

Supported databases: PostgreSQL (WAL / wal2json), MySQL (binlog), CockroachDB (EXPERIMENTAL CHANGEFEED), TiDB (push model).

On the app node — DLL Driver

The DLL Driver is an in-process library that replaces your DB connection string. On startup it connects to the Server Wrapper, subscribes to the CHORUS Fabric stream, and keeps a local PrismResonance index warm. All reads hit the in-process index — no network round-trip, sub-millisecond latency.

pip install "prismlib[fabric]"
python
import psycopg2
conn = psycopg2.connect("postgresql://user:secret@db-host:5432/mydb")

from prism.ffi import PrismDriver, DriverConfig

async with PrismDriver(DriverConfig(wrapper_host="db-proxy-1")) as driver:
    results = await driver.query(
        embedding=my_embedding_vector,
        top_k=5,
        threshold=0.85,
    )

The driver keeps a local PrismResonance cache warm via a background WAL subscription. Reads never touch the DB — they hit the in-process float32 index.

from prism.ffi import PrismDriver, DriverConfig
import numpy as np

config = DriverConfig(
    wrapper_host="10.0.1.50",
    wrapper_port=50051,
    tenant_id="products-service",
)

async with PrismDriver(config) as driver:
    query_vec = np.array([...], dtype=np.float32)
    matches = await driver.query(embedding=query_vec, top_k=10)
    for m in matches:
        print(f"{m.row_id}  score={m.score:.3f}  {m.text_repr}")
async with PrismDriver(config) as driver:
    ack = await driver.write(
        row_id="product-42",
        data={"name": "Widget Pro", "price": 29.99, "stock": 150},
    )
    print(f"Written: event_id={ack.event_id}")
python
// Go
import prism "github.com/insightitsGit/prismlib/go"

driver, _ := prism.Connect("db-proxy-1:50051", "my-tenant")
defer driver.Close()
results, _ := driver.Query(embedding, prism.QueryOpts{TopK: 5, Threshold: 0.85})
js
// C#
using InsightIts.Prism;

await using var driver = new PrismDriver("db-proxy-1:50051", tenantId: "my-tenant");
await driver.ConnectAsync();
var results = await driver.QueryAsync(embedding, topK: 5, threshold: 0.85f);
php
// PHP 8.0+
$driver = new PrismDriver('db-proxy-1', 50051, 'my-tenant');
$driver->connect();
$results = $driver->query($embedding, topK: 5, threshold: 0.85);
┌─ DB Node ──────────────────────────────────────────────────────┐
│  PostgreSQL / MySQL / CockroachDB / TiDB                       │
│       │ WAL / binlog / changefeed                              │
│  ┌────▼───────────────────────────────────────────────────┐    │
│  │  prism-wrapper  (pip install "prismlib[wrapper]")      │    │
│  │  RowVectorizer → TensorCipher (V_enc = V @ K)         │    │
│  │  → HMAC-SHA256 watermark → CHORUSPublisher            │    │
│  └────────────────────────┬───────────────────────────────┘    │
└───────────────────────────┼────────────────────────────────────┘
                            │  CHORUS Fabric (gRPC, encrypted float32)
┌─ App Node — GREEN ────────┼────────────────────────────────────┐
│  ┌────────────────────────▼──────────────────────────────┐     │
│  │  PrismDriver DLL  (pip install "prismlib[fabric]")    │     │
│  │  Subscribe loop → decrypt → PrismResonance index      │     │
│  └──────────────────────────┬────────────────────────────┘     │
│                             │ sub-ms query                     │
│  ┌──────────────────────────▼────────────────────────────┐     │
│  │  Your Application                                      │     │
│  │  ┌─────────────────┐   ┌──────────────────────────┐   │     │
│  │  │  PrismCache     │   │  PrismDriver             │   │     │
│  │  │  LLM cache      │   │  local PrismResonance    │   │     │
│  │  │  [cache]        │   │  (no DB round-trip)      │   │     │
│  │  └─────────────────┘   └──────────────────────────┘   │     │
│  │  ┌──────────────────────────────────────────────────┐  │     │
│  │  │  ClusterCache  ← TOKEN_SYNC frames               │  │     │
│  │  │  AlertManager  ← HEALTH / SIGNAL frames          │  │     │
│  │  └──────────────────────────────────────────────────┘  │     │
│  └────────────────────────────────────────────────────────┘     │
└──────────────────────────────┬─────────────────────────────────┘
                               │  CHORUS mesh
          ┌────────────────────┴────────────────────┐
          │  TOKEN_SYNC · HEALTH · SIGNAL · CONFIG   │
          ▼                                          ▼
┌─ App Node — BLUE ──────┐           ┌─ App Node — ORANGE ─────┐
│  ClusterCache          │           │  ClusterCache            │
│  (warm standby)        │           │  (syncing reserve)       │
│  auto-promotes if      │           │  separate network        │
│  GREEN silent >3s      │           │                          │
└────────────────────────┘           └──────────────────────────┘

Live results from Azure Container App (westus2

, 1 vCPU / 2 GiB, mock LLM baseline):

Scenario Users Duration Hit rate Queries Tokens saved Monthly est.
Light 20 60s 91.0%
5,936 1,374,464 $594
Mixed 50 300s 95.9%
6,973 1,673,216 $723

Numbers use a mock LLM (80ms sleep). With real GPT-4o calls (1–3s), latency speedup is 4–13×; token savings are identical.

Live two-node benchmark (Azure Container Apps westus2

, 30 users × 60s per phase):

Phase Path Avg latency Queries
Baseline (no driver)
App → DB node, network 142.8 ms
3,864
Driver (local index)
App → in-process PrismResonance 2.0 ms
1,479

70.7× faster · 98.6% latency reduction

The 98.6% reduction is a direct result of CHORUS Fabric doing its job. The subscription loop streamed 11,000 rows at 26,000 rows/s from the DB node into the local PrismResonance index before the load test began. By the time the first /driver/query

hit arrived, there were zero network hops — the answer was already in-process. This is what CHORUS Fabric was designed for: getting tensor data to where the query is, before the query arrives.

python benchmark/load/run_driver_benchmark.py \
  --app-url https://prism-benchmark.nicestone-720c6a9b.westus2.azurecontainerapps.io \
  --db-url  https://prism-wrapper-sim.nicestone-720c6a9b.westus2.azurecontainerapps.io \
  --users 30 --duration 60

python benchmark/load/run_benchmark.py \
  --host https://prism-benchmark.nicestone-720c6a9b.westus2.azurecontainerapps.io \
  --scenario mixed

See benchmark/ for full results JSON, Locust CSV files, and the Azure deploy script.

PrismLib is built on two InsightIts open-source libraries. You can use them directly if you need lower-level access.

·[github.com/insightitsGit/prismresonance]pip install prismresonance

The wave-memory similarity engine. Every cache lookup and local vector index in PrismLib goes through PrismResonance.

How it works:

  • Receives a float32 embedding vector
  • Johnson-Lindenstrauss reduces it to 64 dimensions using a projection matrix seeded by SHA-256(tenant_id)

— this is what gives each tenant mathematically isolated address space - Computes similarity as wave interference (cosine in projected space) in three lock-free phases: snapshot → ONNX MatMul → rank

  • Returns ranked candidates in sub-millisecond time entirely in-process

PrismCache wraps this for LLM response caching. PrismDriver's local replica is a PrismResonance index kept warm by WAL streaming.

from prismresonance import PrismProjector, WaveIndex

projector = PrismProjector(dim=64, tenant_id="my-tenant")
index = WaveIndex(projector)

index.add(vector=my_embedding, payload={"row_id": "product-1", "text": "Widget"})
results = index.query(vector=query_embedding, top_k=5, threshold=0.85)

·[github.com/insightitsGit/chorus_fabric]pip install chorus-fabric

The secure gRPC binary streaming protocol for machine-to-machine tensor communication. PrismDriver uses CHORUS Fabric as its transport layer between the server wrapper on the DB node and the DLL driver on the app node.

How it works:

prism-wrapper

(DB node) vectorizes WAL row events viaRowVectorizer

, encrypts them withTensorCipher

(V_enc = V @ K

), appends an HMAC-SHA256 watermark, and publishes batches of raw float32 framesPrismDriver

(app node) opens a persistentWrapperService.Subscribe()

gRPC stream, receives encrypted frames, decrypts, and feeds them into the local PrismResonance index- Transport is pure binary float32 over gRPC server-streaming — no JSON serialization, no REST overhead

  • The WrapperService

proto also exposesQuery

,Write

,Health

, andHello

RPCs for direct interaction

from chorus_fabric import CHORUSPublisher, DriverEndpoint

publisher = CHORUSPublisher(config)
publisher.add_driver(DriverEndpoint(host="10.0.1.50", port=50051, tenant_id="prod"))
await publisher.run(event_queue)  # streams WAL events to all connected drivers

CHORUS Fabric is the same protocol used in the CHORUS M2M system — InsightIts' 4-container gRPC topology for tensor communication between AI agents. The 98.6% latency reduction in the PrismDriver benchmark is direct proof that the protocol works at production scale: 11,000 rows streamed at 26,000 rows/s across Azure inter-container networking, then served locally at 2ms.

PrismLib Micro is the cluster layer built into prismlib[fabric]

. It adds three capabilities on top of the single-node stack — no extra install, no extra infra.

Component What it does
ClusterCache
Shares LLM answers across all nodes via CHORUS TOKEN_SYNC frames. Once any node answers a query, every other node serves it for 0 tokens.
AlertManager
Broadcasts health alerts as SIGNAL frames + admin email the moment CPU/RAM/disk/latency thresholds are crossed. No Prometheus. No Datadog.
Blue/Green/Orange failover
Three-tier hot-standby: GREEN (active), BLUE (warm standby, auto-promotes in ~3s), ORANGE (syncing reserve). No Raft dependency. No K8s operator.
ContextCompressor
Ranks RAG context chunks by cosine similarity, keeps top-K. Saves 58–64% of context tokens before every LLM call. In-process, no extra model.
Metric Result
Token savings — cluster avg 76.1%
BLUE node (cluster cache hit) 100% — 0 LLM calls
ORANGE node (cross-network cache hit) 100% — 0 LLM calls
Context compression 58–64% per query
Health alert propagation <1 s (709–711 ms measured)
Failover — BLUE promoted to GREEN ~3–4 s, no human step

See benchmark/cluster/ for the full benchmark code and

for raw results.

benchmark/cluster/cluster_benchmark_results.json

from prism.cluster.cache import ClusterCache

cache = ClusterCache(node_id="node-1", fabric=chorus_fabric)

answer = await cache.get_or_call(
    query          = user_question,
    query_vector   = embed(user_question),
    call_fn        = lambda: llm.complete(user_question),
    context_chunks = retrieved_docs,    # your RAG chunks
    chunk_vectors  = doc_embeddings,    # their vectors
)

Drop this in front of your existing retrieve → generate

step. No changes to retrieval logic, no changes to your LLM client.

from prism.cluster.alerts import AlertManager, SMTPConfig

alerts = AlertManager(
    fabric = chorus_fabric,
    mail_config = SMTPConfig(
        host="smtp.gmail.com", port=587,
        username="you@gmail.com",
        password=os.getenv("GMAIL_APP_PASS"),
        recipients=["admin@yourcompany.com"],
    ),
)
await alerts.evaluate_health(health_snapshot)
Capability PrismLib Micro Prometheus + Alertmanager Redis cluster Raft / etcd
Cross-node token cache Yes, built-in
No Manual (exact match) No
Alert propagation <1 s, no infra
30–60 s, stack needed No No
Auto failover ~3–4 s, built-in
No Sentinel, 2–30 s 150–500 ms
Context compression 58–64%, free
No No No
Extra infrastructure None
Prometheus stack Redis cluster etcd cluster
Tier Nodes Price Includes
Open source
Unlimited Free forever
All cluster code, Apache 2.0
ChorusMesh Developer (coming soon)
Up to 3 $29/mo after 30-day trial ClusterCache + failover + AlertManager
ChorusMesh Team
Up to 10 $149/mo + Raft consensus, message broker adapters
ChorusMesh Business
Up to 50 $499/mo + multi-region routing, SLA 99.9%
Enterprise
Unlimited Contact us + air-gap, compliance, dedicated Slack

For enterprise agreements: insightits.info@gmail.com

PrismLib is open source (Apache 2.0) and free to use. If your team needs any of the following, contact us for enterprise pricing:

On-premises deployment support— air-gapped installs, hardened Docker images, SOC 2 documentation** SLA-backed support**— guaranteed response times, incident escalation, dedicated Slack channel** Custom embedding model integration**— fine-tuned domain-specific embedders for higher hit rates in specialized domains (legal, medical, finance, code)** Multi-region CHORUS Fabric topology**— active-active DB node clusters, cross-region WAL fan-out, geo-aware driver routing** Audit logging and compliance exports**— per-query access logs, tenant isolation attestation reports, GDPR data lineage** Professional services**— architecture review, migration from Redis/GPTCache, custom RowVectorizer schemas

Contact: insightits.info@gmail.com

GitHub:

github.com/insightitsGit/prismlibPrismLib is free and will stay free. If it saved your team money on OpenAI bills or database infrastructure, consider sponsoring — it covers benchmark compute, maintenance time, and keeps development moving.

Your name or logo here — become a sponsor

It is one packageprismlib

— published once. The wrapper, driver, and cache are all extras of the same package. Users install what they need:

pip install "prismlib[cache]"           # PrismCache only
pip install "prismlib[wrapper]"         # Server Wrapper (DB node)
pip install "prismlib[fabric]"          # DLL Driver (App node)
pip install "prismlib[all]"             # Everything

To publish a new version:

pip install build twine
python -m build

python -m twine upload dist/* --username __token__ --password pypi-YOUR_TOKEN

That's it. One upload covers all three install variants — PyPI resolves the extras automatically.

Apache 2.0 — InsightIts © 2026

── more in #large-language-models 4 stories · sorted by recency
── more on @prismlib 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/prismlib-semantic-ll…] indexed:0 read:13min 2026-06-27 ·