Seeking arXiv cs.AI endorsement – Independent Researcher

wpnews.pro

cd /news/ai-research/seeking-arxiv-cs-ai-endorsement-inde… · home › topics › ai-research › article

[ARTICLE · art-47516] src=discuss.huggingface.co ↗ pub=2026-07-03T21:09Z topic=ai-research verified=true sentiment=↑ positive

Seeking arXiv cs.AI endorsement – Independent Researcher

Independent researcher developed ZATRON, a system that encrypts semantic search embeddings while preserving 98% search quality, preventing attackers from clustering documents by topic. Tested on MSMARCO with 626,906 documents, it outperforms quantization methods and is 8x faster than fully homomorphic encryption, addressing privacy compliance under EU AI Act and GDPR.

read3 min views1 publishedJul 3, 2026

What happens when you hide embeddings but keep search working?

I spent the last few months building a system that does something counterintuitive: it takes semantic search embeddings, makes them completely unreadable, and somehow search still works at 98% quality.

Here’s what that looks like.

Every company using semantic search has a dirty secret: their vector database is a map of their entire document collection’s meaning.

Embeddings cluster by topic. If someone gets access to your vector database — a breach, an insider, a subpoena — they don’t need to read a single document. They can cluster the embeddings and immediately see: these 500 documents are about cancer patients, these 200 are about ongoing litigation, these 100 are salary records.

No decryption needed. The structure IS the leak.

Look at the left side. Same-color dots represent same-topic documents. They cluster together — an attacker immediately sees the structure. The right side is the same 50 documents after ZATRON processing. Random noise. No clusters. No structure.

But here’s the thing: search returns the exact same results on both sides.

ZATRON (Zero-Access Transformed Retrieval Over Noise) transforms embeddings into modular barcodes. The process:

The key insight: modular arithmetic preserves distance relationships but destroys the original values. Two similar documents produce similar modular distances. But the individual barcodes look like random numbers.

Without the key, you can’t unmask them. With the key, you can compare them. You never reconstruct the original embedding.

I tested on real data, not toy examples.

MSMARCO passage retrieval — 626,906 real documents:

The system preserves 98.2% of cosine search quality. Out of 500 queries, the encrypted system returns nearly identical rankings to unencrypted cosine search.

Three different embedding models:

MiniLM: 98.2%. MPNet: 99.2%. BGE: 86.6% (this model’s embedding distribution is less quantization-friendly — I report this honestly). Five languages:

Arabic, Spanish, Korean, Chinese, English — all above 88%.

Comparison with existing methods:

Method	Quality	Encrypted?
Binary quantization	96.9%	No
Scalar int8	98.8%	No
Product quantization	97.9%	No
ZATRON	99.6%	Yes

Higher quality than every quantization method — and the only one that’s encrypted.

I ran eight independent attack vectors. All passed.

But the most convincing evidence is visual:

Left: raw embedding distances perfectly predict true document similarity (ρ = 1.00). An attacker with database access knows exactly which documents are related.

Right: ZATRON barcode distances show zero correlation with true similarity (ρ = 0.09). The attacker gets nothing.

Fully homomorphic encryption (CKKS) can do encrypted search too. But on the same hardware (Google Colab, T4 GPU), CKKS takes 38.9ms per comparison. ZATRON takes 5ms. That’s 8x faster, using only integer arithmetic, no GPU needed.

Both are computationally secure — CKKS under Ring-LWE, ZATRON under PRF (HMAC-SHA256). Different assumptions, both standard.

I want to be precise about what ZATRON is and isn’t:

I state these limitations explicitly because overselling helps nobody.

Any organization that searches sensitive documents: hospitals (patient records), law firms (case files), financial institutions (client data), defense (classified documents). The EU AI Act and GDPR are making embedding privacy a compliance issue, not just a nice-to-have.

The system works. The patent is filed. I’m looking for technical feedback, especially from people building vector search infrastructure.

If you work on vector databases, privacy-preserving ML, or searchable encryption — I’d genuinely appreciate your thoughts. What did I miss? What would break it? What would make it useful?

source & further reading

discuss.huggingface.co — original article Rakarrack-0.6.1 port making progress! ( AI assisted ) Cloud Storage Poll Welcome to Haiku basic(Haiku Docs, Haiku slide and Haiku sheets)

~/api · this article 200

$curl api.wpnews.pro/v1/news/seeking-arxiv-cs-ai-endo…

Read original on discuss.huggingface.co → discuss.huggingface.co/t/seeking-arxiv-cs-ai-end…

mentioned entities

ZATRON

MSMARCO

MiniLM

MPNet

BGE

HMAC-SHA256

CKKS

Google Colab

metadata

slugseeking-arxiv-cs-ai-endorsement-independent-researcher

topic#ai-research

secondary4 topics

sentimentpositive

canonicaldiscuss.huggingface.co

navigation

← prevClaude Fable 5 Isn't Nerfed. The…

next →25 Years of Headaches. Zero Doct…

── more in #ai-research 4 stories · sorted by recency

manticoresearch.com · 25 Jun · #ai-research

14× faster embeddings: how we rebuilt the ONNX path in Manticore

dev.to · 11 Jun · #ai-research

I trained a neural network to break my own encrypted search. It learned nothing.

nanonets.com · 4 Jul · #ai-research

Context graphs: how AI agents remember why decisions were made

github.com · 4 Jul · #ai-research

Show HN: Gavio: open-source interceptor pipeline for production LLM applications

── more on @zatron 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required