From a distributed log invented at LinkedIn to the nervous system powering the worldβs most demanding AI applications β the complete story of Apache Kafka, and why it became the default backbone for production GenAI systems.
In 2010, engineers at LinkedIn faced a crisis that most fast-growing tech companies quietly face: their services were talking to each other in every possible direction, and the whole system had become a fragile, tangled mess. Every new feature required wiring up yet another point-to-point connection. When one service was slow, everything behind it backed up. Data was duplicating, going missing, and arriving out of order.
They built something to solve it. They called it Kafka, after the author Franz Kafka, because the original developer found the system to be βa system optimised for writing.β In 2011, LinkedIn open-sourced it. By 2025, it had become the most widely deployed event streaming platform on Earth β and the foundational infrastructure of every serious GenAI production system.
Kafka does not move data from A to B. It creates a central, durable, replayable record of everything that happens in a system β and lets any service read from it, at any time, at any pace.
The Newspaper Analogy
Forget distributed systems jargon for a moment. Think about how a newspaper works. Journalists write stories. The printing press publishes them into sections β Sports, Finance, World News. Readers subscribe to the sections they care about. They read at their own pace. A reader who picks up Mondayβs paper on Wednesday still gets Mondayβs news β the paper didnβt disappear just because someone else already read it.
Kafka is exactly this, but for software events. **Services **produce events. Those events go into named topics. Other services consume from those topics, at their own pace, without disrupting producers or each other. And like that newspaper archive, messages donβt vanish after being read β theyβre retained for a configurable period (commonly seven days), allowing any consumer to replay history whenever needed.
Why Kafka Isnβt Just a Queue
When engineers first encounter Kafka, they often make a category error: they assume itβs a message queue, like RabbitMQ or Amazon SQS. It isnβt β and the difference matters enormously.
This distinction becomes critical in GenAI. When an LLM produces a response, you might need to log it, route it to a websocket server, run it through a safety filter, store it in a database, and feed it to an analytics pipeline β all simultaneously, all independently. A queue would require five separate queues and five separate producers. Kafka requires one topic and five consumer groups.
A Kafka cluster is a group of broker nodes β servers that store and serve event data. Each topic is divided into partitions, and those partitions are distributed across brokers. This is the key to Kafkaβs horizontal scalability: adding more brokers and partitions lets you distribute load linearly.
// Kafka Cluster β Logical Viewββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ KAFKA CLUSTER ββ ββ βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ ββ β Broker 1 β β Broker 2 β β Broker 3 β ββ β β β β β β ββ β Topic-A P0 β β Topic-A P1 β β Topic-A P2 β ββ β [LEADER] β β [LEADER] β β [LEADER] β ββ β β β β β β ββ β Topic-A P1 β β Topic-A P2 β β Topic-A P0 β ββ β [REPLICA] β β [REPLICA] β β [REPLICA] β ββ βββββββββββββββββ βββββββββββββββββ βββββββββββββββββ ββ β ββ Controller (KRaft) ββ β manages metadata & leader election β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β Producers write β Consumers read
Each partition is an ordered, immutable log. New events are appended to the end. Consumers read from the beginning, tracking their progress with an offset number. The critical rule: messages within a partition are always ordered. Messages across partitions are not.
This matters in GenAI when, for example, you need all messages from the same userβs conversation to arrive in the correct sequence. The solution is to use the userβs ID as the message key β Kafka consistently routes all **messages **with the same key to the same partition, preserving order per user while still distributing load across partitions.
Every partition has** one leader and one or more replicas. Producers always write to the leader. Replicas continuously sync from the leader.** If a broker fails, one of the in-sync replicas is automatically elected as the new leader β with zero data loss if configured correctly.
Production Rule
Always run with replication factor 3 and min in-sync replicas 2. This means your data survives the loss of any single broker. If you only have two brokers and one dies, writes safely stop rather than accepting data that canβt be replicated.
Consumer groups are** Kafkaβs elegant solution** to the problem of **parallel processing. **When multiple consumer instances belong to the same group, Kafka automatically divides the partitions between them β no two consumers in the same group ever read the same partition. Add more consumers, and Kafka rebalances the workload automatically.
This makes scaling your LLM worker pool trivially simple: start more workers in the same consumer group, and Kafka will distribute the inference queue across all of them without any manual routing logic.
Large Language Models are extraordinary at what they do β and terrible at the things distributed systems need to be good at. An LLM call can take anywhere from one to thirty seconds. It costs real money. It might fail. The service behind it canβt handle a thousand simultaneous synchronous requests without collapsing.
This is precisely where Kafka becomes** indispensable**. Instead of hammering an LLM API directly and synchronously from your application layer, you model the system as events: **a request arrives, gets written to Kafka, and a pool of workers consumes and processes those requests at a controlled rate. **The requester gets an acknowledgment immediately, and the actual LLM response is delivered asynchronously when itβs ready.
βKafka turns your LLM API from a fragile synchronous bottleneck into a resilient, observable, scalable pipeline.β
// The fundamental shift in AI infrastructure architecture
The GenAI Event Map
Every mature GenAI production system resolves into a handful of recurring patterns. Kafka is the connective tissue between all of them.
// GenAI Event Pipeline β Full ArchitectureUser / Application β βΌ API Gateway β ββββΊ [ai-inference-requests] βββΊ LLM Worker Pool (N workers) β TOPIC β β Claude / GPT / Llama β β β [ai-inference-results] β β β WebSocket Handler β (streams to user) β ββββΊ [document-ingestion] βββΊ Chunker Workers β β β [document-chunks] β β β Embedding Workers β β β Vector Database β (Pinecone / Weaviate / pgvector) β ββββΊ [agent-tasks] βββΊ Planner Agent β β β βββββββββββββββΌββββββββββββββ β βΌ βΌ βΌ β Research Coder Reviewer β Agent Agent Agent β βββββββββββββββΌββββββββββββββ β βΌ β [agent-results] β ββββΊ [ai-metrics] βββΊ Observability Stack (Prometheus / Grafana)
Pattern 1: The LLM Request Queue
The most common and immediately valuable pattern. Every** inference request** is treated as an event. The application produces it to a Kafka topic and returns a request ID to the caller. A pool of LLM worker processes consumes from that topic, calls the model, and publishes the result to a response topic. A websocket or polling mechanism delivers the result back to the user.
// API Gateway receives user requestWHEN user_request arrives: request_id = generate_uuid() redis_set(request_id, status="queued") kafka_produce( topic = "ai-inference-requests", key = user_id, // same user β same partition β ordered payload = { request_id, prompt, model, user_id } ) RETURN { request_id, status: "queued" }// LLM Worker (one of N running instances)CONTINUOUSLY consume_from("ai-inference-requests", group="llm-workers"): redis_update(request_id, status="processing") response = call_llm_api(prompt, model) kafka_produce( topic = "ai-inference-results", key = user_id, payload = { request_id, content, usage, status } ) manual_commit_offset() // only AFTER successful processing// On failure: send to Dead Letter QueueIF llm_call fails: kafka_produce(topic="ai-inference-requests.dlq", ...) manual_commit_offset() /
Pattern 2: Real-Time RAG Document Pipeline
Retrieval-Augmented Generation (RAG) is the dominant pattern for grounding LLMs in proprietary or recent knowledge. But keeping a vector database current β processing thousands of documents as they arrive β requires a streaming pipeline. This is Kafkaβs home territory.
Ingest
A document (PDF, web page, database record) arrives and is published as a raw event to the document-ingestion topic. The producing service immediately returns β no waiting for processing.
Chunk
A chunker worker consumes the document and splits it into semantically meaningful segments, typically 256β512 tokens with some overlap. Each chunk is published to a document-chunks topic.
Embed
An embedding worker batches incoming chunks and passes them to an embedding model. The results are vectors β numerical representations of meaning. Embedding in batches dramatically reduces API cost and latency.
Store
Vectors and their source chunks are upserted into the vector database (Pinecone, Weaviate, pgvector). The document is now searchable by semantic similarity within seconds of arrival.
// Stage 1: IngestFUNCTION ingest_document(doc_id, raw_text, metadata): kafka_produce( topic = "document-ingestion", key = doc_id, payload = { doc_id, raw_text, metadata, timestamp } )// Stage 2: Chunker WorkerCONTINUOUSLY consume_from("document-ingestion", group="chunkers"): chunks = split_by_semantics( text = doc.raw_text, chunk_size = 512 tokens, overlap = 64 tokens ) FOR EACH chunk IN chunks: kafka_produce( topic = "document-chunks", key = doc_id + "-" + chunk_index, payload = { doc_id, chunk_id, content, metadata } ) commit_offset()// Stage 3: Embedding Worker (batched for efficiency)CONTINUOUSLY consume_batch_from("document-chunks", batch_size=32): contents = extract_text_from_batch(chunks) vectors = embedding_model.encode_batch(contents) vector_db.upsert(ids=chunk_ids, vectors=vectors, metadata=metadata) commit_offset()
Pattern 3: Multi-Agent Orchestration
Agentic AI systems β where multiple specialized AI agents collaborate to complete complex tasks β are the frontier of GenAI architecture. Kafka is a natural fit for agent coordination: each agent is simply a consumer that reads tasks, does work, and publishes results or subtasks to other topics.
The Planner agent breaks down a complex request into subtasks and publishes them to specialized topics. Research agents pull from the research queue, gather information, and publish findings. Code agents write and test code. Review agents validate outputs. Each runs independently and at its own scale β without any agent knowing the others exist, just the events they share.
// Planner Agent: consumes tasks, routes subtasksCONTINUOUSLY consume_from("agent-tasks", group="planners"): plan = llm_call( prompt = "Break this task into subtasks: " + task.description ) FOR EACH subtask IN plan.subtasks: target_topic = route_by_type(subtask.type) // research β "agent-research" // coding β "agent-coding" // review β "agent-review" kafka_produce(target_topic, subtask)// Research Agent: fully independent, scales separatelyCONTINUOUSLY consume_from("agent-research", group="researchers"): findings = search_and_synthesize(subtask.description) kafka_produce( topic = "agent-results", payload = { task_id, agent: "research", findings } )
Running Kafka in a demo is easy. Running it in production β where message loss means a userβs LLM query disappears forever, where a consumer group crash must not corrupt the pipeline, where one bad message must not block an entire partition β requires deliberate configuration.
The Producer Side: Reliability Settings
Three configuration decisions define whether your producer is production-grade.
PRODUCER_CONFIG = { acks : "all", // wait for ALL in-sync replicas enable_idempotence : true, // exactly-once writes (no duplicates) retries : 10, retry_backoff_ms : 500, linger_ms : 5, // batch messages for 5ms β higher throughput batch_size : 65536, // 64KB batches compression : "snappy", // fast compression, reduces broker storage security_protocol : "SASL_SSL", // always TLS in production}FUNCTION publish(topic, key, payload): producer.produce(topic, key, payload, on_delivery=delivery_callback)FUNCTION delivery_callback(error, message): IF error: log.error("Delivery failed", error) alert_ops() // never silently lose messages ELSE: log.info(partition, offset, timestamp)
The Consumer Side: Exactly-Once Processing
The most common production mistake is enabling auto-commit on the consumer. This means Kafka commits the offset immediately upon receiving a message β before your application has actually processed it. If your LLM call fails after commit, the message is gone. The correct approach is manual offset commit: commit the offset only after successful processing.
CONSUMER_CONFIG = { group_id : "llm-workers", auto_offset_reset : "earliest", // start from beginning if no offset stored enable_auto_commit : false, // CRITICAL: never auto-commit session_timeout_ms : 30000, max_poll_records : 500,}CONTINUOUSLY poll_messages(): FOR EACH message IN batch: TRY: result = call_llm_api(message.payload) kafka_produce("ai-inference-results", result) manual_commit(message.offset) // ONLY on success EXCEPT: send_to_dead_letter_queue(message) manual_commit(message.offset) // still commit β don't block partition
In production, messages will fail to process. The LLM API might be down. The payload might be malformed. Your worker might crash mid-processing. The Dead Letter Queue (DLQ) pattern handles all of these cases gracefully: failed messages are routed to a topic.dlq topic with metadata about why they failed, where they can be inspected, replayed, or alerted on β without blocking the main pipeline.
Consumer lag is the number of messages in a topic that have been published but not yet consumed. In a GenAI system, lag is the direct measure of how backed up your LLM workers are. A lag of zero means your workers are keeping up. A lag of ten thousand means users are waiting. A lag of a million means something is fundamentally wrong β a worker died, the LLM API is down, or you need to scale.
Production Monitoring Rule
Set an alert on consumer lag for your LLM worker group. A threshold of 1,000 unprocessed messages should trigger a scale-out event or on-call notification. Kafka makes this easy β consumer lag is a first-class observable metric.
When your LLM request queue is backing up, scaling is trivially simple. Start more worker processes in the same consumer group. Kafka automatically rebalances partitions across all workers, distributing the load without any configuration change or deployment restart. Set up Kubernetes Horizontal Pod Autoscaler based on consumer lag and your system scales itself.
// Stream processing: count requests per user per 60-second windowSTREAM from_topic("ai-inference-requests") |> group_by(user_id) |> tumbling_window(duration=60 seconds) |> count() |> filter(count > RATE_LIMIT) |> produce_to("rate-limit-violations")// Agent reads violations and blocks those users' requestsCONTINUOUSLY consume_from("rate-limit-violations"): redis_set(user_id, "rate_limited", ttl=60 seconds) notify_user(user_id, "Rate limit reached. Try again in 60s.")
Every pattern in this article distills into a set of rules. These are the decisions that separate a Kafka prototype from a Kafka production system.
βYou donβt add Kafka to a GenAI system to make it faster. You add it to make it survivable.β
// On the difference between a prototype and a production AI platform
The deepest thing Kafka teaches is not a technical pattern β itβs a way of thinking about systems. When you adopt Kafka, you stop thinking about your architecture as a set of services calling each other, and start thinking about it as a universe of events that services react to. This shift is not cosmetic. It changes how you debug, how you scale, how you audit, and how you recover from failure.
For GenAI systems, this philosophy is not optional β itβs structural. LLMs are slow, expensive, and occasionally unavailable. The only way to build a platform that thousands of users can rely on is to decouple every inference from the request that triggered it, and let a durable, distributed log hold the contract between them.
Kafka is that log. And in the age of generative AI, the teams who understand it deeply will build systems that simply outlast the ones that donβt.
Apache Kafka & The Age of GenAI was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.