Kafka gives AI pipelines async decoupling, replay, and backpressure; Zilla adds JWT identity, schemas, access filtering, and SSE.
AI systems rarely fail in production because of the model.
More often, they fail because the infrastructure beneath them was designed for a completely different class of workload.
In production, AI workloads introduce variable latency, retries, concurrency spikes, backpressure, and multi-tenant access control problems that traditional synchronous systems struggle to model cleanly. The demo may work over HTTP request-response chains, but production is not a demo.
Production is thousands of users submitting queries simultaneously while the LLM takes eight seconds to respond. It is an embedding service hitting rate limits while ingestion traffic keeps arriving. It is a retried request accidentally creating duplicate embeddings in the vector database. It is enterprise users, standard-tier users, and free-tier users all querying the same system simultaneously while expecting access only to the information they are authorized to see.
None of those are model problems. They are infrastructure problems.
And infrastructure problems need infrastructure solutions.
AI Workloads Do Not Behave Like Traditional APIs #
A production RAG pipeline is not a single API call. It is a chain of asynchronous operations with different latency characteristics, throughput limits, and failure modes.
A document chunk arrives and needs to be embedded through an external API call. The embedding is stored in a vector database. A user query triggers another embedding request, followed by similarity search, context assembly, and an LLM inference step that may take several seconds to complete.
Critically, these stages are independent.
You need ingestion to continue even when embedding slows down. You need query processing isolated from document indexing load. You need retries without duplication. You need answers streamed back to the correct user without polling.
These are not merely performance optimizations. They are architectural requirements that event-driven systems express naturally, but synchronous request chains cannot model cleanly.
Why Kafka Fits AI Pipelines Naturally #
Kafka maps closely to the operational behavior AI systems require.
Decoupled Services
In a Kafka-based architecture, the ingestion service writes document chunks to a topic without needing to know which embedding model is running, how fast the vector database is responding, or whether downstream consumers are under load. The embedder consumes independently at its own pace. If the embedding model changes from text-embedding-3-small to a locally hosted alternative, nothing upstream changes.
That decoupling matters because AI systems evolve continuously.
Replayability
AI systems constantly regenerate derived state. If you upgrade your embedding model, you may need to re-embed the entire corpus. With Kafka, replaying the topic rebuilds the downstream state without reconstructing ingestion history. If a RAG pipeline crashes mid-processing, consumers resume from committed offsets instead of losing requests or silently dropping work.
The event log becomes both the transport layer and the system of record.
Structural Backpressure
LLMs and embedding APIs have hard throughput ceilings. In synchronous systems, slow inference propagates latency back through the request chain. Under load, this often turns into cascading failure.
Kafka changes the behavior fundamentally. Slow consumers accumulate lag instead of blocking producers. Traffic spikes become queues that drain at sustainable rates — which matters enormously in AI systems where latency is variable by design.
Independent Consumers
AI pipelines are not single-hop workflows. The same stream of document events may feed embedding services, classifiers, evaluation pipelines, monitoring systems, and audit consumers — each scaling independently without coupling itself to the others.
Kafka Is the Backbone, Not the Client Interface #
Kafka is an excellent event backbone. It is not, by itself, a client-facing API.
Your users still expect REST endpoints, JWT authentication, schema validation, streaming responses, tenant isolation, and browser compatibility. The naïve solution is to build a custom HTTP service in front of Kafka.
That works initially. But over time, every governance concern — authentication, identity propagation, schema enforcement, access control, rate limiting — becomes a conditional in application code, and every new tenant rule becomes another deployment. Governance spreads across services instead of living in one place, and downstream services must simply trust whatever identity the wrapper forwards.
That architecture becomes difficult to reason about because governance is no longer centralized.
Why Identity Propagation Becomes Critical in AI Systems #
Multi-tenant AI systems need more than authentication. They need trusted identity propagation across asynchronous workflows.
Consider a RAG system with multiple visibility tiers: free-tier users can access public knowledge, standard-tier users can access internal knowledge, and enterprise users can access confidential knowledge. The tier originates from a JWT presented at the API boundary. Downstream services need that identity information to filter retrieval results, determine generation context, and enforce delivery permissions.
Kafka itself does not validate JWTs or propagate trusted user identity into message headers. Without centralized governance, developers typically solve this by writing custom middleware that validates tokens and forwards metadata into Kafka — but now the trust boundary lives inside application code, and every downstream service depends on the correctness of that middleware implementation.
That is the gap Zilla closes.
How Zilla Closes the Gap #
Zilla Platform sits between clients and Kafka, speaking HTTP on one side and Kafka protocol on the other. Instead of embedding governance logic into application services, Zilla moves governance to the edge.
A request flow looks like this:
POST /queries
Authorization: Bearer <jwt>
→ Zilla validates JWT
→ extracts user tier claim
→ injects trusted Kafka headers
→ writes event to rag.queries
→ RAG pipeline consumes asynchronously→ result written to rag.results
→ client receives streamed response over SSE
The AI services themselves remain focused on AI logic rather than transport concerns.
Identity Injection at the Edge
When a client sends a JWT, Zilla validates the token and injects trusted identity headers into Kafka messages — for example, user-tier: enterprise. Downstream services consume the header directly. The embedder, retrieval layer, and RAG chain do not need to validate JWTs independently. The access decision is made once at the edge, and the proof of that decision travels with the event.
Schema Enforcement
Malformed payloads should fail at the boundary, not deep inside asynchronous processing pipelines. Zilla validates JSON schemas before events enter Kafka. A request missing a required doc_id, or a query where question is not a string, receives an immediate 400 response. Invalid events never reach the backbone.
Native Streaming Responses
AI systems are fundamentally asynchronous, but browser clients still expect real-time interaction. Zilla bridges this through Server-Sent Events: a client opens GET /results/{queryId}, Zilla subscribes to the Kafka results topic, and responses stream to the browser the moment they arrive — no polling infrastructure, no custom SSE service to write or operate.
Per-Subscriber Filtering
Multiple users may subscribe to the same results topic simultaneously. Zilla filters streamed events using the subscriber identity extracted from the JWT, so an enterprise user receives enterprise-tier results and a standard-tier user receives only what they are authorized to see. That enforcement happens at the gateway layer rather than inside every downstream service.
What the Architecture Looks Like in Practice: Demo #
The Zilla Platform RAG demo implements these patterns end to end. A single docker compose up starts Kafka, Qdrant, an embedding service, a RAG chain service, and Zilla — all configured through a single zilla.yaml.
The flow looks like this:
Client (JWT)
│
├── POST /chunks → Zilla validates JWT + schema → write to rag.chunks
├── POST /queries → Zilla injects user-tier header → write to rag.queries
└── GET /results → Zilla subscribes to rag.results → SSE to client
rag.chunks → Embedder → Qdrant
rag.queries → RAG Chain:
→ embed query
→ search Qdrant with visibility filter
→ call LLM
→ write result to rag.results
The access model is structural rather than application-defined. A free-tier user's query searches only public content, a standard-tier user reaches public and internal content, and an enterprise user reaches confidential content as well. The visibility tier originates from the JWT and propagates through the event stream as trusted metadata — no tier value ever originates from the client itself.
Run the Zilla Platform RAG demo at https://github.com/aklivity/zilla-platform-demos/tree/main/rag-project. The demo includes a browser interface, multi-tier JWT tokens, and a complete walkthrough of the architecture described above.
The Architecture You Do Not Have to Rebuild Later #
The core argument for event-driven AI infrastructure is not that it is more sophisticated. It is that it models the operational behavior AI systems already have.
When your embedding model changes, you replay the topic. When ingestion traffic spikes, consumers accumulate lag instead of collapsing the request path. When governance rules evolve, you update centralized policy rather than rewriting application logic. When compliance teams ask which user received which answer, the event log already contains the history.
Zilla compounds these advantages by centralizing governance at the edge — identity propagation, schema validation, rate limits, delivery filtering, streaming APIs. The governance layer remains stable even as the AI services behind it evolve.
Swap the LLM. Replace the vector database. Add new consumers. Replay historical data.
The boundary still holds.
To learn more about Zilla Platform and event-driven AI infrastructure, request demo.
Ready to Get Started? #
Get started on your own or request a demo with one of our data management experts.
Flexible pricing
Start for free and scale with flexible, deployment-based pricing.
Join the Community
Ask, engage, and contribute alongside fellow data practitioners.