How to Design an Event-Driven Analytics Pipeline — A System Design Deep Dive

A fintech startup designed an event-driven analytics pipeline to calculate agent commissions from raw events. The system uses an RDBMS staging table, a cron job to copy events to AWS S3, and AWS Lambda for report generation, with a Kafka-based controller for on-demand requests. The architecture prioritizes simplicity and eventual consistency over low-latency reads.

Interview prep series: real problems, real trade-offs, no hand-waving. Here's a scenario straight from a fintech startup I worked at. We had field agents responsible for onboarding customers, managing investments, and handling deposits and withdrawals. Management wanted to shift from a fixed salary model to fixed + commission — where commission depended on each agent's monthly performance. The immediate instinct was: just write a SQL query. And that's exactly what we did, initially. But it quickly fell apart: The underlying issue wasn't the SQL. It was that the system was storing events but not state — and retrieving current state required replaying all those events manually every single time. This is the problem the design solves. Before jumping to architecture, get the requirements sharp. Functional: Non-Functional: Notice what's not in the requirements: low latency reads, strong consistency, sub-second aggregations. This is an eventually consistent, batch-oriented analytics system. That shapes every decision downstream. Quick back-of-envelope to understand the data shape: At this scale, you don't need a distributed streaming platform. Kafka is present here, but for a different reason — we'll get to that. The design splits into three loosely coupled parts. This is the plug-and-play property: each part can fail or be replaced without breaking the others. Every front-end web, mobile, agent fires events on meaningful user actions. The back-end receives these events, enriches them based on event type adds agent ID, customer context, timestamp, session metadata , and writes them to the existing RDBMS. Why RDBMS at this stage? Because each event is: A cron job runs at a configured interval, reads the unprocessed events, copies them to AWS S3 , and on success marks them for deletion. A second pass deletes the marked records. This two-step mark → delete pattern is intentional — if the S3 write fails mid-batch, nothing gets deleted. You get at-least-once delivery to S3, which is fine for analytics. Data lifecycle: After 12 months in S3, events are moved to local NAS or cold storage like S3 Glacier . This keeps active storage costs low without destroying the audit trail. Web App / Mobile / Agent App | ▼ Back-end Service → enriches event → RDBMS Staging Table | Cron Job | AWS S3 Bucket | after 12 months → NAS / Cold Storage Key interview insight: The cron-based approach is deliberately simple. In a higher-volume system you'd replace it with a CDC Change Data Capture mechanism using Debezium or similar — but at this scale, simplicity wins. Once data lands in S3, report generation becomes a data transformation problem. Python scripts and AWS Lambda read from S3, apply aggregations, and write results CSV, JSON back to a designated S3 output directory. Lambda works well here because: Generated reports are stored in S3 and a notification email with a download link is sent to the requester. The report is never pushed directly — only a link. This decouples generation latency from delivery. This is the operability layer — and it's often missing from system design answers. The controller is a small Java module that: The on-demand API is fire-and-forget . The client calls POST /reports/generate with parameters, gets back a 202 Accepted immediately. The request is pushed to a Kafka topic . A consumer picks it up, processes the report, writes to S3, and emails the link. Why Kafka here, not a simple task queue or direct Lambda invocation? Client → POST /reports/generate | REST Controller → 202 Accepted | Kafka Topic: report-requests | Report Consumer | S3 Output → Email Notification These are the questions an interviewer is likely to probe: Why not stream events directly to S3 instead of staging in RDBMS first? You could use Kinesis Firehose or Kafka directly. But at this scale, staging in RDBMS first keeps the architecture within the team's existing operational expertise. The cron approach is easy to monitor, debug, and replay if something goes wrong. Streaming adds operational overhead that isn't justified until volume demands it. Why not use a dedicated analytics DB Redshift, BigQuery instead of S3 + Lambda? For ad hoc SQL queries and dashboards, a warehouse would be better. But the requirement here is structured reports on a schedule, not exploratory analytics. S3 + Lambda is cheaper, has no idle cost, and fits the access pattern exactly. What happens if the cron fails halfway through a batch? The mark-then-delete pattern handles this. Undeleted records will be picked up in the next cron run and re-written to S3 idempotent if keyed properly . At-least-once delivery is acceptable for analytics. What if report volume grows 100x? Replace the cron with CDC + Kafka Streams for near-real-time event flow. Move report generation from Lambda to a Spark or Flink job on EMR for large aggregations. The S3-centric design makes this migration possible without changing the data model. Three things worth remembering from this design: Decoupling at every boundary. The front-ends don't know about S3. The cron doesn't know about report formats. The controller doesn't know about Lambda. Each part can be replaced independently. Matching storage to access pattern. RDBMS for write-once structured events. S3 for batch reads and cost-effective archival. Kafka for durable async processing. No single store is doing all three jobs. Operability is part of the design. The controller module — configurable schedules, on-demand API, email notifications — is what turns a pipeline into a product. Systems that work but can't be operated in production are incomplete designs. | Question | Answer | |---|---| | Why stage in RDBMS before S3? | Write-once, structured, transactional — fits the access pattern; simpler than streaming at this scale | | Why Kafka for on-demand reports? | Durable, replayable, handles backpressure — fire-and-forget API needs async processing guarantee | | Why Lambda for report generation? | Stateless batch jobs, pay-per-use, no idle cost — matches the workload perfectly | | How do you handle cron failure? | Mark-for-deletion before delete — unprocessed records are re-picked on next run | | How do you scale this? | CDC + Kafka Streams replaces cron; Spark/Flink replaces Lambda at high volume | | What's the data lifecycle? | Hot RDBMS staging → Warm S3, 12 months → Cold NAS/Glacier | The best system design answers don't start with "let's use Kafka and microservices." They start with the problem, reason through the constraints, and arrive at the simplest architecture that meets the requirements — with a clear articulation of what you'd change if those constraints shifted. This pipeline is deliberately simple. And that's the point. Found this useful? I write about distributed systems, fintech engineering, and senior-level interview prep at amkantawala.hashnode.dev. Drop a reaction if you'd like more designs in this series.