Designing a Reliable Notification System for 1M+ Users (Push, SMS, Email)

The article explains that for fintech platforms with over 1 million users, notification systems must be designed as resilient distributed systems rather than simple API calls. It recommends using asynchronous message queues (like Redis Streams, SQS, or Kafka) to decouple request handling from delivery, implementing idempotency keys to prevent duplicate sends, and employing a provider router with fallback options to handle third-party failures gracefully.

In fintech, notifications are not a “nice-to-have” feature. They’re part of the product’s trust layer. If a user transfers money and doesn’t get a confirmation, they panic. If an OTP arrives 3 minutes late, login fails. If price alerts come twice, users lose confidence fast. At small scale, sending notifications feels simple: Application → Twilio/SendGrid → Done But once you’re dealing with millions of users, multiple channels, retries, provider outages, and traffic spikes… notification systems become distributed systems problems. And distributed systems are mostly about handling failure gracefully. Imagine a fintech platform sending: …to over 1 million users across: The challenge isn’t just “sending messages.” The real challenge is making sure the system: Here’s the architecture I’d use. App Service │ ▼ Notification Queue Redis Streams / SQS / Kafka │ ▼ Worker Pool │ ▼ Provider Router │ ┌────┼───────────────────────────────┐ ▼ ▼ ▼ SMS Email Push │ │ │ ▼ ▼ ▼ Twilio SendGrid FCM/APNs │ ▼ Fallback Providers Termii / Mailgun / Direct APNs │ ▼ Delivery Log + Idempotency Store PostgreSQL + Redis One of the biggest mistakes teams make early on is sending notifications directly from the API request cycle. That works… until traffic spikes. Imagine Black Friday, a crypto market crash, or salary payment day. Suddenly, millions of notifications need to go out almost at once. If your application waits for Twilio or SendGrid to respond before returning a response to the user, your entire app becomes hostage to external providers. That’s dangerous. Instead, the API should do one thing: Accept the request quickly and push a notification event into a queue. From there, worker processes handle delivery asynchronously. This changes the system completely. Queues give you: If providers slow down, the queue absorbs the spike instead of crashing your application. At this scale, queues stop being optional infrastructure. They become the safety buffer protecting the rest of your system. Recommended technologies: The hardest problem in notification systems usually isn’t failed sends. It’s duplicate sends. Users are surprisingly tolerant of delayed notifications. They are not tolerant of receiving the same debit alert three times. Retries are where duplicates usually happen. Example: To prevent this, every notification should carry an idempotency key . Before sending, workers check: “Have we already processed this exact notification?” Example constraint: UNIQUE user id, notification type, idempotency key This is one of those small architectural decisions that saves massive operational pain later. Even if retries happen multiple times, the database becomes the final protection layer against duplicates. Every delivery attempt should also be logged. Not just successes — everything. notification attempts Including: Because when something goes wrong in production, you want evidence, not guesses. A reality every senior engineer eventually learns: Third-party providers fail more often than you expect. Twilio can degrade. SendGrid can throttle requests. FCM can delay pushes. The mistake is designing systems that assume providers are always available. Reliable systems assume failure is normal. So instead of hardcoding a single provider, introduce a provider routing layer. The worker flow becomes: Users shouldn’t notice your provider had a bad day. That’s the goal. Retries sound simple until they start causing damage. Bad retry systems can: A common mistake is retrying too aggressively. If Twilio is already struggling, hammering it with thousands of immediate retries only makes things worse. Instead, use exponential backoff. Example: Retry 1 → 30 seconds Retry 2 → 2 minutes Retry 3 → 10 minutes This gives providers time to recover while keeping pressure manageable. And after maximum retries? Move the message into a Dead Letter Queue DLQ . That queue is basically your “something unusual happened here” bucket. At that point, engineers should be alerted. One subtle issue in distributed systems: Sometimes providers say “accepted” even though delivery eventually fails. That creates dangerous blind spots. A notification may look successful internally while the user never actually receives it. This is why reconciliation jobs matter. Every few minutes, background jobs should scan for suspicious states: Notifications stuck in "pending" for too long Then: → Re-query provider APIs → Update delivery status → Retry if needed These jobs quietly save systems from edge cases caused by: A lot of reliability engineering is really just building systems that continuously self-correct. Good notification systems are not just reliable. They’re respectful. Users should control how they’re contacted. Examples: Simple table: user notification settings …can dramatically improve user experience. Rate limiting matters too. Without it, bugs or loops can become expensive very quickly. Imagine accidentally sending OTPs in a retry loop to thousands of users. Redis-based limits help protect against this. Example: Max 3 SMS/hour/user That protects: At scale, invisible systems are dangerous systems. You need to know: The most important metrics are usually boring operational ones. Then business-level metrics: And finally: alerts. Example: Alert if SMS failure rate exceeds 5% for 2 minutes The earlier you detect degradation, the smaller the incident becomes. The biggest difference between systems that look reliable and systems that are reliable is failure testing. Because everything works in happy-path demos. The real question is: What happens when dependencies misbehave? One useful strategy is shadow testing. Route a tiny percentage of production traffic through a new provider and compare results safely. Example: Chaos testing is also incredibly valuable. Example: Intentionally fail 10% of Twilio requests in staging That sounds scary initially. But it validates whether: Reliable systems are engineered through controlled failure exposure. What makes this architecture resilient is that it assumes bad things will happen. Because eventually: The system survives because reliability is built into the architecture itself. By combining: …the platform continues operating even during partial outages and heavy traffic spikes. And in fintech, reliability isn’t just infrastructure quality. It directly affects user trust. Most notification systems work during normal traffic. That’s not the hard part. The hard part is surviving: That’s where architecture starts to matter. Because users rarely remember the notifications that worked. They remember the moments when communication failed during something important.