{"slug": "designing-tiktok-from-scratch-a-system-design-deep-dive", "title": "Designing TikTok from Scratch — A System Design Deep Dive", "summary": "A system design deep dive into TikTok's architecture reveals a platform handling over 1 billion monthly active users, 34 million daily video uploads, and 26 Tbps of peak egress bandwidth, with a target P99 feed latency of 167ms. The platform's infrastructure is divided into four domains—ingestion, serving, recommendation, and social graph—with approximately 70% of video traffic served directly from edge nodes across 150+ cities using Anycast routing. Key technical components include chunked multi-part uploads with SHA-256 deduplication, a two-tower neural network for recommendation, and asynchronous communication via Kafka for non-critical paths.", "body_md": "Who is this for?Mid-to-senior engineers preparing for system design interviews, or anyone curious how a short-video platform at billion-user scale actually works under the hood.\n\n| Metric | Number |\n|---|---|\n| Monthly active users | 1B+ |\n| Videos uploaded per day | ~34 million |\n| Target feed latency (P99) | ~167ms |\n| Peak egress bandwidth | ~26 Tbps |\n\nBefore drawing a single box, nail down what the system must do — and what it doesn't need to do perfectly on day one.\n\n**Functional requirements:**\n\n**Non-functional requirements:**\n\nThe system splits into four major domains: **ingestion** (upload pipeline), **serving** (read path), **recommendation** (ML feed), and **social graph**.\n\n```\n┌─────────────────────────────────────────────────┐\n│              Mobile / Web Clients                │\n└─────────────────────┬───────────────────────────┘\n                      │\n┌─────────────────────▼───────────────────────────┐\n│         Global CDN / Edge PoPs                   │\n│   Video delivery, static assets, geo-routing    │\n└─────────────────────┬───────────────────────────┘\n                      │\n┌─────────────────────▼───────────────────────────┐\n│       API Gateway + Load Balancer                │\n│   Auth, rate limiting, routing, TLS termination │\n└────────┬────────────┴────────────────┬──────────┘\n         │                             │\n   ┌─────▼──────┐  ┌──────────────┐  ┌▼────────────────┐\n   │  Upload    │  │ Feed Service │  │  Social Graph   │\n   │  Service   │  │(pre-compute  │  │    Service      │\n   │            │  │ + real-time) │  │                 │\n   └─────┬──────┘  └──────┬───────┘  └┬────────────────┘\n         │                │            │\n   ┌─────▼──────┐  ┌──────▼───────┐  ┌▼────────────────┐\n   │ Transcode  │  │Recommendation│  │  Notification   │\n   │  Workers   │  │   Engine     │  │    Service      │\n   └─────┬──────┘  └──────┬───────┘  └┬────────────────┘\n         │                │            │\n   ┌─────▼──────┐  ┌──────▼───────┐  ┌▼────────────────┐\n   │  Object    │  │ Feature Store│  │  Search Service │\n   │  Storage   │  │(Redis+Cassie)│  │ (Elasticsearch) │\n   └─────┬──────┘  └──────┬───────┘  └┬────────────────┘\n         │                │            │\n┌────────▼────────────────▼────────────▼──────────────┐\n│              Async Message Bus (Kafka)               │\n└──────────┬──────────────┬──────────────┬────────────┘\n           │              │              │\n    ┌──────▼─────┐ ┌──────▼────┐ ┌──────▼──────┐\n    │MySQL/Vitess│ │   Redis   │ │  Cassandra  │\n    │(user data, │ │ (counters,│ │ (timelines, │\n    │ metadata)  │ │  cache)   │ │  history)   │\n    └────────────┘ └───────────┘ └─────────────┘\n```\n\n*All services communicate asynchronously via Kafka for non-critical paths.*\n\nTikTok's secret weapon. **~70% of video traffic** is served directly from edge nodes in 150+ cities, bypassing origin entirely. It uses Anycast routing to send users to the nearest PoP. Manifest files (playlist URLs) are invalidated within seconds of a video going viral.\n\nChunked multi-part upload (5 MB chunks) tolerates flaky mobile connections. Workers dedup via `SHA-256`\n\nbefore writing. Transcode jobs run on GPU fleets — outputs include `360p`\n\n, `720p`\n\n, `1080p`\n\n, and HEVC variants. Thumbnails and stills are extracted for ML feature generation.\n\nA **two-tower neural network**:\n\nDot product gives a relevance score. The model runs online for top-k retrieval, then a ranker applies real-time signals (trending, friend activity) before the feed is assembled.\n\nThis is where TikTok differs from Twitter/Instagram:\n\nThe feed service merges both lists, injects ML-recommended videos, and applies diversity rules to avoid repetition. Final feed is cached in Redis with a `300s`\n\nTTL.\n\nAll write events (upload complete, like, follow, watch-complete) are published to Kafka topics. Downstream consumers include:\n\nTopics are partitioned by `user_id`\n\nfor ordered processing per user. This decouples services and allows independent scaling.\n\n| Store | Use Case | Why |\n|---|---|---|\nMySQL / Vitess |\nUser profiles, video metadata, social graph | ACID, sharded by `user_id`\n|\nRedis Cluster |\nCounters (likes, views), session tokens, feed cache | Sub-millisecond reads |\nCassandra |\nWatch history, timelines, notification logs | Wide-row reads, high write throughput |\n\nThe classic dilemma in social feed systems. TikTok uses a **hybrid approach** (the \"celebrity problem\" split):\n\n**Fan-out on write** (for accounts with millions of followers):\n\n**Fan-out on read** (for regular users):\n\nLike/view counts can lag by a few seconds — nobody notices. But user authentication tokens and billing events require **strong consistency**. TikTok segments these into separate storage tiers with different consistency guarantees, accepting complexity for throughput on hot paths.\n\nLikes and comments use **WebSocket push** for real-time delivery. Less critical notifications (weekly summaries, suggested follows) use a **pull-based batch pipeline** that runs every few hours — no need to maintain a persistent connection for a weekly digest email.\n\nAssumptions:1B MAU, 500M DAU, avg user watches 45 min/day, avg video = 30 sec ~= 8 MB (720p). 34M uploads/day ~= 400 uploads/sec peak.\n\n**Storage:**\n\n```\n34M uploads/day x 8 MB x 3 resolutions = ~816 TB/day of new video\nWith 3x replication over 5 years = ~4.4 EB total raw storage\n```\n\n**Feed reads:**\n\n```\n500M DAU x 20 feed refreshes/day / 86,400 sec = ~115,000 feed reads/sec\nWith 95% Redis cache hit rate -> recommendation backend sees ~5,750 rps\n```\n\n**Bandwidth:**\n\n```\n500M users x 45 min x 2 Mbps (720p) / 86,400 = ~26 Tbps peak egress\n```\n\nThis is why TikTok operates its own backbone in many regions and has deep-peering agreements with major ISPs.\n\nMost social platforms optimize for social graph traversal — *show me what people I follow posted*. TikTok inverted this: **the algorithm is the product**. The architecture is built around a recommendation pipeline that must be both blazing-fast and constantly learning from watch signals.\n\nThree things stand out:\n\n**Aggressive edge caching** — they push video delivery as close to the user as physically possible. The CDN is not a performance optimization; it is the entire delivery strategy.\n\n**Real-time ML feedback loops** — a video's trajectory is decided in the first 30 minutes based on completion rate signals. A new creator can go viral without any followers.\n\n**Microservice isolation** — upload, serving, recommendation, and social graph are independently deployable and scalable, preventing any single bottleneck from cascading.\n\nIf you're using this for a system design interview:\n\n*Found this useful? Follow for more system design deep dives — next up: designing YouTube's upload pipeline at scale.*", "url": "https://wpnews.pro/news/designing-tiktok-from-scratch-a-system-design-deep-dive", "canonical_source": "https://dev.to/danikeya/designing-tiktok-from-scratch-a-system-design-deep-dive-57j8", "published_at": "2026-05-25 22:24:05+00:00", "updated_at": "2026-05-25 23:03:27.882547+00:00", "lang": "en", "topics": ["machine-learning", "ai-infrastructure", "ai-products", "ai-tools", "mlops"], "entities": ["TikTok"], "alternates": {"html": "https://wpnews.pro/news/designing-tiktok-from-scratch-a-system-design-deep-dive", "markdown": "https://wpnews.pro/news/designing-tiktok-from-scratch-a-system-design-deep-dive.md", "text": "https://wpnews.pro/news/designing-tiktok-from-scratch-a-system-design-deep-dive.txt", "jsonld": "https://wpnews.pro/news/designing-tiktok-from-scratch-a-system-design-deep-dive.jsonld"}}