{"slug": "designing-a-reliable-notification-system-for-1m-users-push-sms-email", "title": "Designing a Reliable Notification System for 1M+ Users (Push, SMS, Email)", "summary": "The article explains that for fintech platforms with over 1 million users, notification systems must be designed as resilient distributed systems rather than simple API calls. It recommends using asynchronous message queues (like Redis Streams, SQS, or Kafka) to decouple request handling from delivery, implementing idempotency keys to prevent duplicate sends, and employing a provider router with fallback options to handle third-party failures gracefully.", "body_md": "In fintech, notifications are not a “nice-to-have” feature.\nThey’re part of the product’s trust layer.\nIf a user transfers money and doesn’t get a confirmation, they panic.\nIf an OTP arrives 3 minutes late, login fails.\nIf price alerts come twice, users lose confidence fast.\nAt small scale, sending notifications feels simple:\nApplication → Twilio/SendGrid → Done\nBut once you’re dealing with millions of users, multiple channels, retries, provider outages, and traffic spikes… notification systems become distributed systems problems.\nAnd distributed systems are mostly about handling failure gracefully.\nImagine a fintech platform sending:\n…to over 1 million users across:\nThe challenge isn’t just “sending messages.”\nThe real challenge is making sure the system:\nHere’s the architecture I’d use.\n[App Service]\n│\n▼\n[Notification Queue (Redis Streams / SQS / Kafka)]\n│\n▼\n[Worker Pool]\n│\n▼\n[Provider Router]\n│\n┌────┼───────────────────────────────┐\n▼ ▼ ▼\nSMS Email Push\n│ │ │\n▼ ▼ ▼\nTwilio SendGrid FCM/APNs\n│\n▼\nFallback Providers\n(Termii / Mailgun / Direct APNs)\n│\n▼\n[Delivery Log + Idempotency Store]\n(PostgreSQL + Redis)\nOne of the biggest mistakes teams make early on is sending notifications directly from the API request cycle.\nThat works… until traffic spikes.\nImagine Black Friday, a crypto market crash, or salary payment day.\nSuddenly, millions of notifications need to go out almost at once.\nIf your application waits for Twilio or SendGrid to respond before returning a response to the user, your entire app becomes hostage to external providers.\nThat’s dangerous.\nInstead, the API should do one thing:\nAccept the request quickly and push a notification event into a queue.\nFrom there, worker processes handle delivery asynchronously.\nThis changes the system completely.\nQueues give you:\nIf providers slow down, the queue absorbs the spike instead of crashing your application.\nAt this scale, queues stop being optional infrastructure.\nThey become the safety buffer protecting the rest of your system.\nRecommended technologies:\nThe hardest problem in notification systems usually isn’t failed sends.\nIt’s duplicate sends.\nUsers are surprisingly tolerant of delayed notifications.\nThey are not tolerant of receiving the same debit alert three times.\nRetries are where duplicates usually happen.\nExample:\nTo prevent this, every notification should carry an idempotency_key\n.\nBefore sending, workers check:\n“Have we already processed this exact notification?”\nExample constraint:\nUNIQUE(user_id, notification_type, idempotency_key)\nThis is one of those small architectural decisions that saves massive operational pain later.\nEven if retries happen multiple times, the database becomes the final protection layer against duplicates.\nEvery delivery attempt should also be logged.\nNot just successes — everything.\nnotification_attempts\nIncluding:\nBecause when something goes wrong in production, you want evidence, not guesses.\nA reality every senior engineer eventually learns:\nThird-party providers fail more often than you expect.\nTwilio can degrade.\nSendGrid can throttle requests.\nFCM can delay pushes.\nThe mistake is designing systems that assume providers are always available.\nReliable systems assume failure is normal.\nSo instead of hardcoding a single provider, introduce a provider routing layer.\nThe worker flow becomes:\nUsers shouldn’t notice your provider had a bad day.\nThat’s the goal.\nRetries sound simple until they start causing damage.\nBad retry systems can:\nA common mistake is retrying too aggressively.\nIf Twilio is already struggling, hammering it with thousands of immediate retries only makes things worse.\nInstead, use exponential backoff.\nExample:\nRetry #1 → 30 seconds\nRetry #2 → 2 minutes\nRetry #3 → 10 minutes\nThis gives providers time to recover while keeping pressure manageable.\nAnd after maximum retries?\nMove the message into a Dead Letter Queue (DLQ).\nThat queue is basically your “something unusual happened here” bucket.\nAt that point, engineers should be alerted.\nOne subtle issue in distributed systems:\nSometimes providers say “accepted” even though delivery eventually fails.\nThat creates dangerous blind spots.\nA notification may look successful internally while the user never actually receives it.\nThis is why reconciliation jobs matter.\nEvery few minutes, background jobs should scan for suspicious states:\nNotifications stuck in \"pending\" for too long\nThen:\n→ Re-query provider APIs\n→ Update delivery status\n→ Retry if needed\nThese jobs quietly save systems from edge cases caused by:\nA lot of reliability engineering is really just building systems that continuously self-correct.\nGood notification systems are not just reliable.\nThey’re respectful.\nUsers should control how they’re contacted.\nExamples:\nSimple table:\nuser_notification_settings\n…can dramatically improve user experience.\nRate limiting matters too.\nWithout it, bugs or loops can become expensive very quickly.\nImagine accidentally sending OTPs in a retry loop to thousands of users.\nRedis-based limits help protect against this.\nExample:\nMax 3 SMS/hour/user\nThat protects:\nAt scale, invisible systems are dangerous systems.\nYou need to know:\nThe most important metrics are usually boring operational ones.\nThen business-level metrics:\nAnd finally: alerts.\nExample:\nAlert if SMS failure rate exceeds 5% for 2 minutes\nThe earlier you detect degradation, the smaller the incident becomes.\nThe biggest difference between systems that look reliable and systems that are reliable is failure testing.\nBecause everything works in happy-path demos.\nThe real question is:\nWhat happens when dependencies misbehave?\nOne useful strategy is shadow testing.\nRoute a tiny percentage of production traffic through a new provider and compare results safely.\nExample:\nChaos testing is also incredibly valuable.\nExample:\nIntentionally fail 10% of Twilio requests in staging\nThat sounds scary initially.\nBut it validates whether:\nReliable systems are engineered through controlled failure exposure.\nWhat makes this architecture resilient is that it assumes bad things will happen.\nBecause eventually:\nThe system survives because reliability is built into the architecture itself.\nBy combining:\n…the platform continues operating even during partial outages and heavy traffic spikes.\nAnd in fintech, reliability isn’t just infrastructure quality.\nIt directly affects user trust.\nMost notification systems work during normal traffic.\nThat’s not the hard part.\nThe hard part is surviving:\nThat’s where architecture starts to matter.\nBecause users rarely remember the notifications that worked.\nThey remember the moments when communication failed during something important.", "url": "https://wpnews.pro/news/designing-a-reliable-notification-system-for-1m-users-push-sms-email", "canonical_source": "https://dev.to/ejiro/designing-a-reliable-notification-system-for-1m-users-push-sms-email-2i39", "published_at": "2026-05-24 00:54:07+00:00", "updated_at": "2026-05-24 01:01:38.667453+00:00", "lang": "en", "topics": ["developer-tools", "cloud-computing", "data", "startups", "enterprise-software"], "entities": ["Twilio", "SendGrid", "Redis", "Kafka", "PostgreSQL", "FCM", "APNs", "Mailgun"], "alternates": {"html": "https://wpnews.pro/news/designing-a-reliable-notification-system-for-1m-users-push-sms-email", "markdown": "https://wpnews.pro/news/designing-a-reliable-notification-system-for-1m-users-push-sms-email.md", "text": "https://wpnews.pro/news/designing-a-reliable-notification-system-for-1m-users-push-sms-email.txt", "jsonld": "https://wpnews.pro/news/designing-a-reliable-notification-system-for-1m-users-push-sms-email.jsonld"}}