Apache Kafka & The Age of GenAI

Apache Kafka, originally built at LinkedIn in 2010 to solve data integration challenges, has become the default event streaming backbone for production generative AI systems by 2025, enabling durable, replayable, and scalable data pipelines that decouple producers and consumers.

From a distributed log invented at LinkedIn to the nervous system powering the world’s most demanding AI applications — the complete story of Apache Kafka, and why it became the default backbone for production GenAI systems. In 2010, engineers at LinkedIn faced a crisis that most fast-growing tech companies quietly face: their services were talking to each other in every possible direction, and the whole system had become a fragile, tangled mess. Every new feature required wiring up yet another point-to-point connection. When one service was slow, everything behind it backed up. Data was duplicating, going missing, and arriving out of order. They built something to solve it. They called it Kafka, after the author Franz Kafka, because the original developer found the system to be “a system optimised for writing. ” In 2011, LinkedIn open-sourced it. By 2025, it had become the most widely deployed event streaming platform on Earth — and the foundational infrastructure of every serious GenAI production system. Kafka does not move data from A to B. It creates a central, durable, replayable record of everything that happens in a system — and lets any service read from it, at any time, at any pace. The Newspaper Analogy Forget distributed systems jargon for a moment. Think about how a newspaper works. Journalists write stories. The printing press publishes them into sections — Sports, Finance, World News. Readers subscribe to the sections they care about. They read at their own pace. A reader who picks up Monday’s paper on Wednesday still gets Monday’s news — the paper didn’t disappear just because someone else already read it. Kafka is exactly this, but for software events. Services produce events . Those events go into named topics . Other services consume from those topics, at their own pace, without disrupting producers or each other. And like that newspaper archive, messages don’t vanish after being read — they’re retained for a configurable period commonly seven days , allowing any consumer to replay history whenever needed. Why Kafka Isn’t Just a Queue When engineers first encounter Kafka, they often make a category error: they assume it’s a message queue, like RabbitMQ or Amazon SQS. It isn’t — and the difference matters enormously. This distinction becomes critical in GenAI. When an LLM produces a response, you might need to log it, route it to a websocket server, run it through a safety filter, store it in a database, and feed it to an analytics pipeline — all simultaneously, all independently. A queue would require five separate queues and five separate producers. Kafka requires one topic and five consumer groups. A Kafka cluster is a group of broker nodes — servers that store and serve event data . Each topic is divided into partitions , and those partitions are distributed across brokers . This is the key to Kafka’s horizontal scalability: adding more brokers and partitions lets you distribute load linearly. // Kafka Cluster — Logical View┌─────────────────────────────────────────────────────────────┐│ KAFKA CLUSTER ││ ││ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ││ │ Broker 1 │ │ Broker 2 │ │ Broker 3 │ ││ │ │ │ │ │ │ ││ │ Topic-A P0 │ │ Topic-A P1 │ │ Topic-A P2 │ ││ │ LEADER │ │ LEADER │ │ LEADER │ ││ │ │ │ │ │ │ ││ │ Topic-A P1 │ │ Topic-A P2 │ │ Topic-A P0 │ ││ │ REPLICA │ │ REPLICA │ │ REPLICA │ ││ └───────────────┘ └───────────────┘ └───────────────┘ ││ │ ││ Controller KRaft ││ — manages metadata & leader election — │└─────────────────────────────────────────────────────────────┘ ↑ Producers write ↓ Consumers read Each partition is an ordered, immutable log. New events are appended to the end. Consumers read from the beginning, tracking their progress with an offset number . The critical rule: messages within a partition are always ordered. Messages across partitions are not. This matters in GenAI when, for example, you need all messages from the same user’s conversation to arrive in the correct sequence . The solution is to use the user’s ID as the message key — Kafka consistently routes all messages with the same key to the same partition, preserving order per user while still distributing load across partitions. Every partition has one leader and one or more replicas . Producers always write to the leader. Replicas continuously sync from the leader. If a broker fails, one of the in-sync replicas is automatically elected as the new leader — with zero data loss if configured correctly. Production Rule Always run with replication factor 3 and min in-sync replicas 2. This means your data survives the loss of any single broker. If you only have two brokers and one dies, writes safely stop rather than accepting data that can’t be replicated. Consumer groups are Kafka’s elegant solution to the problem of parallel processing. When multiple consumer instances belong to the same group, Kafka automatically divides the partitions between them — no two consumers in the same group ever read the same partition. Add more consumers, and Kafka rebalances the workload automatically. This makes scaling your LLM worker pool trivially simple: start more workers in the same consumer group, and Kafka will distribute the inference queue across all of them without any manual routing logic. Large Language Models are extraordinary at what they do — and terrible at the things distributed systems need to be good at. An LLM call can take anywhere from one to thirty seconds. It costs real money. It might fail. The service behind it can’t handle a thousand simultaneous synchronous requests without collapsing. This is precisely where Kafka becomes indispensable . Instead of hammering an LLM API directly and synchronously from your application layer, you model the system as events: a request arrives, gets written to Kafka, and a pool of workers consumes and processes those requests at a controlled rate. The requester gets an acknowledgment immediately, and the actual LLM response is delivered asynchronously when it’s ready. “Kafka turns your LLM API from a fragile synchronous bottleneck into a resilient, observable, scalable pipeline.” // The fundamental shift in AI infrastructure architecture The GenAI Event Map Every mature GenAI production system resolves into a handful of recurring patterns. Kafka is the connective tissue between all of them. // GenAI Event Pipeline — Full ArchitectureUser / Application │ ▼ API Gateway │ ├──► ai-inference-requests ──► LLM Worker Pool N workers │ TOPIC │ │ Claude / GPT / Llama │ │ │ ai-inference-results │ │ │ WebSocket Handler │ streams to user │ ├──► document-ingestion ──► Chunker Workers │ │ │ document-chunks │ │ │ Embedding Workers │ │ │ Vector Database │ Pinecone / Weaviate / pgvector │ ├──► agent-tasks ──► Planner Agent │ │ │ ┌─────────────┼─────────────┐ │ ▼ ▼ ▼ │ Research Coder Reviewer │ Agent Agent Agent │ └─────────────┼─────────────┘ │ ▼ │ agent-results │ └──► ai-metrics ──► Observability Stack Prometheus / Grafana Pattern 1: The LLM Request Queue The most common and immediately valuable pattern. Every inference request is treated as an event. The application produces it to a Kafka topic and returns a request ID to the caller. A pool of LLM worker processes consumes from that topic, calls the model, and publishes the result to a response topic. A websocket or polling mechanism delivers the result back to the user. // API Gateway receives user requestWHEN user request arrives: request id = generate uuid redis set request id, status="queued" kafka produce topic = "ai-inference-requests", key = user id, // same user → same partition → ordered payload = { request id, prompt, model, user id } RETURN { request id, status: "queued" }// LLM Worker one of N running instances CONTINUOUSLY consume from "ai-inference-requests", group="llm-workers" : redis update request id, status="processing" response = call llm api prompt, model kafka produce topic = "ai-inference-results", key = user id, payload = { request id, content, usage, status } manual commit offset // only AFTER successful processing// On failure: send to Dead Letter QueueIF llm call fails: kafka produce topic="ai-inference-requests.dlq", ... manual commit offset / Pattern 2: Real-Time RAG Document Pipeline Retrieval-Augmented Generation RAG is the dominant pattern for grounding LLMs in proprietary or recent knowledge. But keeping a vector database current — processing thousands of documents as they arrive — requires a streaming pipeline. This is Kafka’s home territory. Ingest A document PDF, web page, database record arrives and is published as a raw event to the document-ingestion topic. The producing service immediately returns — no waiting for processing. Chunk A chunker worker consumes the document and splits it into semantically meaningful segments, typically 256–512 tokens with some overlap. Each chunk is published to a document-chunks topic. Embed An embedding worker batches incoming chunks and passes them to an embedding model. The results are vectors — numerical representations of meaning. Embedding in batches dramatically reduces API cost and latency. Store Vectors and their source chunks are upserted into the vector database Pinecone, Weaviate, pgvector . The document is now searchable by semantic similarity within seconds of arrival. // Stage 1: IngestFUNCTION ingest document doc id, raw text, metadata : kafka produce topic = "document-ingestion", key = doc id, payload = { doc id, raw text, metadata, timestamp } // Stage 2: Chunker WorkerCONTINUOUSLY consume from "document-ingestion", group="chunkers" : chunks = split by semantics text = doc.raw text, chunk size = 512 tokens, overlap = 64 tokens FOR EACH chunk IN chunks: kafka produce topic = "document-chunks", key = doc id + "-" + chunk index, payload = { doc id, chunk id, content, metadata } commit offset // Stage 3: Embedding Worker batched for efficiency CONTINUOUSLY consume batch from "document-chunks", batch size=32 : contents = extract text from batch chunks vectors = embedding model.encode batch contents vector db.upsert ids=chunk ids, vectors=vectors, metadata=metadata commit offset Pattern 3: Multi-Agent Orchestration Agentic AI systems — where multiple specialized AI agents collaborate to complete complex tasks — are the frontier of GenAI architecture. Kafka is a natural fit for agent coordination: each agent is simply a consumer that reads tasks, does work, and publishes results or subtasks to other topics. The Planner agent breaks down a complex request into subtasks and publishes them to specialized topics. Research agents pull from the research queue, gather information, and publish findings. Code agents write and test code. Review agents validate outputs. Each runs independently and at its own scale — without any agent knowing the others exist, just the events they share. // Planner Agent: consumes tasks, routes subtasksCONTINUOUSLY consume from "agent-tasks", group="planners" : plan = llm call prompt = "Break this task into subtasks: " + task.description FOR EACH subtask IN plan.subtasks: target topic = route by type subtask.type // research → "agent-research" // coding → "agent-coding" // review → "agent-review" kafka produce target topic, subtask // Research Agent: fully independent, scales separatelyCONTINUOUSLY consume from "agent-research", group="researchers" : findings = search and synthesize subtask.description kafka produce topic = "agent-results", payload = { task id, agent: "research", findings } Running Kafka in a demo is easy. Running it in production — where message loss means a user’s LLM query disappears forever, where a consumer group crash must not corrupt the pipeline, where one bad message must not block an entire partition — requires deliberate configuration. The Producer Side: Reliability Settings Three configuration decisions define whether your producer is production-grade. PRODUCER CONFIG = { acks : "all", // wait for ALL in-sync replicas enable idempotence : true, // exactly-once writes no duplicates retries : 10, retry backoff ms : 500, linger ms : 5, // batch messages for 5ms → higher throughput batch size : 65536, // 64KB batches compression : "snappy", // fast compression, reduces broker storage security protocol : "SASL SSL", // always TLS in production}FUNCTION publish topic, key, payload : producer.produce topic, key, payload, on delivery=delivery callback FUNCTION delivery callback error, message : IF error: log.error "Delivery failed", error alert ops // never silently lose messages ELSE: log.info partition, offset, timestamp The Consumer Side: Exactly-Once Processing The most common production mistake is enabling auto-commit on the consumer. This means Kafka commits the offset immediately upon receiving a message — before your application has actually processed it. If your LLM call fails after commit, the message is gone. The correct approach is manual offset commit: commit the offset only after successful processing. CONSUMER CONFIG = { group id : "llm-workers", auto offset reset : "earliest", // start from beginning if no offset stored enable auto commit : false, // CRITICAL: never auto-commit session timeout ms : 30000, max poll records : 500,}CONTINUOUSLY poll messages : FOR EACH message IN batch: TRY: result = call llm api message.payload kafka produce "ai-inference-results", result manual commit message.offset // ONLY on success EXCEPT: send to dead letter queue message manual commit message.offset // still commit — don't block partition In production, messages will fail to process. The LLM API might be down. The payload might be malformed. Your worker might crash mid-processing. The Dead Letter Queue DLQ pattern handles all of these cases gracefully: failed messages are routed to a topic.dlq topic with metadata about why they failed, where they can be i nspected, replayed, or alerted on — without blocking the main pipeline. Consumer lag is the number of messages in a topic that have been published but not yet consumed. In a GenAI system, lag is the direct measure of how backed up your LLM workers are. A lag of zero means your workers are keeping up. A lag of ten thousand means users are waiting. A lag of a million means something is fundamentally wrong — a worker died, the LLM API is down, or you need to scale. Production Monitoring Rule Set an alert on consumer lag for your LLM worker group. A threshold of 1,000 unprocessed messages should trigger a scale-out event or on-call notification. Kafka makes this easy — consumer lag is a first-class observable metric. When your LLM request queue is backing up, scaling is trivially simple. Start more worker processes in the same consumer group. Kafka automatically rebalances partitions across all workers, distributing the load without any configuration change or deployment restart. Set up Kubernetes Horizontal Pod Autoscaler based on consumer lag and your system scales itself. // Stream processing: count requests per user per 60-second windowSTREAM from topic "ai-inference-requests" | group by user id | tumbling window duration=60 seconds | count | filter count RATE LIMIT | produce to "rate-limit-violations" // Agent reads violations and blocks those users' requestsCONTINUOUSLY consume from "rate-limit-violations" : redis set user id, "rate limited", ttl=60 seconds notify user user id, "Rate limit reached. Try again in 60s." Every pattern in this article distills into a set of rules. These are the decisions that separate a Kafka prototype from a Kafka production system. “You don’t add Kafka to a GenAI system to make it faster. You add it to make it survivable.” // On the difference between a prototype and a production AI platform The deepest thing Kafka teaches is not a technical pattern — it’s a way of thinking about systems. When you adopt Kafka, you stop thinking about your architecture as a set of services calling each other, and start thinking about it as a universe of events that services react to. This shift is not cosmetic. It changes how you debug, how you scale, how you audit, and how you recover from failure. For GenAI systems, this philosophy is not optional — it’s structural. LLMs are slow, expensive, and occasionally unavailable. The only way to build a platform that thousands of users can rely on is to decouple every inference from the request that triggered it, and let a durable, distributed log hold the contract between them. Kafka is that log. And in the age of generative AI, the teams who understand it deeply will build systems that simply outlast the ones that don’t. Apache Kafka & The Age of GenAI https://pub.towardsai.net/apache-kafka-the-age-of-genai-18d6745afb91 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.