Search Autocomplete Systems — Complete Guide

A developer's guide explains how to build a search autocomplete system using a Trie data structure with top-K caching to meet strict latency requirements. The system must handle billions of queries per day with sub-100ms response times by precomputing and storing the most relevant completions at each Trie node.

An autocomplete system predicts and suggests query completions as a user types, character by character. The goal is to reduce typing effort, surface popular or relevant queries, and improve UX through speed. Real-world examples: Functional requirements: Given a prefix typed by a user, return the top-K typically 5–10 most relevant completions within strict latency SLA under 100ms end-to-end . Non-functional requirements: 99.99% availability, horizontal scalability to billions of queries/day, eventual consistency for frequency updates stale-by-minutes is acceptable , sub-50ms p99 response time at the API layer, multi-region support. The hardest constraint is the latency budget . At Google scale, every character keystroke fires a request. A user typing "machine learning" fires 16 requests. With millions of concurrent users, this means tens of millions of requests per second with a hard sub-100ms SLA.--- A Trie prefix tree is a tree data structure where each node represents a single character. A path from root to any node spells out a prefix. A path to a marked node spells a complete word. Why Trie for autocomplete? Lookup by prefix is O L where L is prefix length — independent of vocabulary size. With a hash map you'd need to scan all keys. With a sorted array you'd binary search but can't retrieve all matching extensions efficiently. The Trie's structure literally mirrors the problem: navigating down the tree IS the prefix search. Node structure: TrieNode { children: Map<char, TrieNode // up to 26 for a-z, more for Unicode isEndOfWord: bool frequency: int // how many times this word was searched topK: List<{word, freq} // cached top-K at this node } Operations: insert word, freq — traverse/create nodes character by character, mark isEndOfWord=true , set frequency. O L . search prefix — traverse to the prefix node. If not found, return . Otherwise collect all words in the subtree rooted there. O L + N where N = subtree size. delete word — traverse to the word's end node, unmark isEndOfWord . Optionally prune leaf nodes upward. O L . Complexity summary: | Operation | Time | Space | |---|---|---| | Insert | O L | O L × alphabet | | Search by prefix | O L + N | O 1 extra | | Delete | O L | O 1 | | Full Trie | — | O N × L × alphabet | Advantages: Fast prefix lookup, natural prefix grouping, supports wildcard and fuzzy variants. Disadvantages: High memory — each node holds up to 26+ child pointers. For 100M terms with avg length 10, naive Trie needs tens of GB of RAM. This forces compression and sharding. When a user types a prefix q : q . isEndOfWord=true descendants.The naive version of step 3 has O subtree size cost. For popular single-character prefixes like "a", the subtree could contain millions of words — completely unacceptable. This is why we need Top-K caching at each node. Top-K means: for any given prefix, return only the K most relevant completions. The key insight is to precompute and cache top-K at every Trie node rather than traversing the full subtree at query time. Ranking strategies: score = freq × e^ -λ × days ago . Captures trending topics. Storing Top-K in Trie nodes: Each node stores a list of up to K {word, score} pairs representing the best completions reachable from that node. When a new word is inserted or a frequency updated, propagate changes upward: each ancestor node merges the new candidate into its top-K list. Min Heap vs Max Heap: Use a Min Heap of size K to maintain top-K efficiently. When processing a new candidate: This gives O log K per insertion. Final top-K extraction is O K log K . For Max Heap : better when you need to extract elements in descending order one at a time streaming output . But for building top-K over a large set, the Min Heap approach is more memory-efficient.With Top-K cached at every node, a query for prefix q becomes O L — traverse L nodes, read the precomputed top-K list. No subtree scan needed. The Trie alone is still insufficient. Even with O L lookup, the Trie lives in one process's memory. At Google scale, you need distributed caching to handle millions of QPS without hammering Trie servers. Cache-per-prefix architecture: For each prefix string e.g., "a", "ap", "app" , cache the top-K result list in Redis. Cache key = the prefix string. Cache value = JSON array of top-K suggestions. cache "a" → "apple", "amazon", "airbnb", "android", "api" cache "ap" → "apple", "api", "app store", "apex", "apply" cache "app" → "app store", "apple music", "apple id", "application", "applebee's" markdown Cache key design: Use a consistent lowercase-normalized prefix as the key. For Redis: autocomplete:v2:{prefix} — version-prefix allows atomic cache invalidation on Trie rebuild. Cache invalidation strategies: autocomplete:v2: to autocomplete:v3: . Old keys expire via TTL. Zero-downtime transition. Cache hit rates in practice: The prefix "t" gets millions of hits/hour — stays in L1 memory of Redis. "the quick brown" — rare, likely a cache miss. This power-law distribution means ~20% of prefixes handle 80% of traffic. Cache them aggressively; let rare prefixes fall through to the Trie service. Cache stampede problem: When a popular cache key expires, thousands of concurrent requests all miss and hammer the Trie simultaneously. Solutions: 1 probabilistic early expiry — refresh the key before it expires with small probability; 2 mutex/lock on cache miss — first thread computes, others wait; 3 background refresh — a worker proactively refreshes top-N prefixes before TTL expires. The flow: search-events . INCR or a dedicated counter service like Druid . Real-time vs batch tradeoff: | Real-time updates | Batch updates | | |---|---|---| | Frequency freshness | Seconds | Hours | | Trie write cost | High every search modifies Trie | Low rebuild once per cycle | | Complexity | Very high distributed locks, race conditions | Moderate | | Production choice | Only for trending/signals layer | Yes, for core Trie | At Google scale, real-time Trie mutation is impractical — millions of writes/second with propagation to ancestor nodes creates massive contention. The industry standard is batch rebuilds with a real-time trending overlay. Incremental update strategy: Rather than rebuilding from scratch hourly, maintain a delta log. Apply only changed frequencies. For a word whose frequency changed from 500 to 600, update the affected nodes and re-sort top-K lists upward. Cost: O L × K log K per updated word. The core principle: never update the live Trie on every search . Instead, batch-process search logs periodically and rebuild the Trie or apply deltas offline, then hot-swap the new version atomically. ETL Pipeline: Blue-Green deployment of Trie versions: At any time, two Trie instances exist — "blue" live and "green" rebuilding . Once green passes validation, a routing config update atomically shifts traffic. Blue becomes the rollback target. If green shows anomalies regressions, coverage drops , one config change rolls back in seconds. Zero downtime, zero query loss. Production cadence: Most systems run hourly rebuilds during low-traffic windows, with a "trending override" layer that injects real-time signals last 15 minutes of search spikes on top of the hourly Trie. Assumptions: 1B searches/day = ~11,500 QPS average. With peak factor of 3× = ~35,000 QPS peak. 100M unique terms. p99 latency < 100ms. Capacity math at scale: Sharding strategy: Shard by first 2 characters of prefix. "aa"–"am" → shard 1, "an"–"az" → shard 2, etc. Each shard handles a fraction of the prefix space. A consistent hashing ring see Section 9 handles rebalancing when nodes are added. High availability: Each Trie shard has 3 replicas leader + 2 followers . Reads hit any replica. Writes new Trie deployment go to leader, sync to followers. If leader dies, a follower is promoted within seconds via Raft consensus. Disaster recovery: Cross-region replication of Trie snapshots to S3/GCS. RPO = last hourly snapshot. RTO = ~5 minutes to restore from snapshot into a fresh cluster. A single Trie for 100M terms with top-K lists is 50GB+ and handles millions of QPS — impossible to run on one machine with acceptable latency and reliability. Sharding by prefix range-based : Shard 0: a–f prefixes starting with a,b,c,d,e,f Shard 1: g–m Shard 2: n–s Shard 3: t–z + digits Simple, predictable routing. A request for prefix "apple" always goes to shard 0. Problem: Hot prefixes. "a" is searched far more than "x". Shard 0 gets 10× the traffic of shard 3 — this is the "hot shard" problem. Sharding by hash: Hash the prefix string to determine shard. shard = hash "apple" % N . Distributes load evenly. Problem: Two-character prefixes "ap" and "app" land on different shards, breaking prefix locality. Every request requires a broadcast to all shards and result merging — expensive. Consistent hashing: Place shards on a virtual ring. Map prefix to a point on the ring; the shard clockwise from that point owns it. Adding/removing shards only remaps a fraction of keys. Supports virtual nodes for load balancing. Best for prefix-range sharding where you want smooth rebalancing. Production recommendation: Shard by first 2 characters using a weighted partition scheme give "a-" more shards than "x-" proportional to expected traffic . Replicate each shard 3×. Use consistent hashing for shard assignment to enable elastic scaling. | Data Structure | Memory | Lookup | Best For | Weakness | |---|---|---|---|---| | Standard Trie | High 26 pointers/node | O L | Teaching, small vocab | Memory bloat | | Compressed Trie Radix | Moderate | O L | Production prefix search | Complex to implement | | Patricia Trie | Low | O L | IP routing, binary keys | Complex insert | | Ternary Search Tree | Moderate | O L + log N | Spell check, ordered ops | Slower than Trie | | FST Finite State Transducer | Very low shared suffixes | O L | Lucene, production NLP | Immutable once built | | Elasticsearch/Lucene | High inverted index | O log N | Full-text search + autocomplete | Overkill for pure prefix | FST Finite State Transducer is what Lucene/Elasticsearch uses internally. It compresses both shared prefixes AND shared suffixes into a directed acyclic graph. Memory usage is 10–100× smaller than a standard Trie for large vocabularies. The tradeoff: building an FST is expensive, and the structure is typically immutable — you rebuild on update. Radix Tree is the practical production choice for custom implementations. It compresses chains of single-child nodes the path "a→p→p" in standard Trie becomes one node "app" . Memory savings ~10× for English vocabulary. Lookups are the same O L . Elasticsearch brings typo-tolerance edit distance matching , language analysis, scoring with TF-IDF or BM25, and horizontal scaling out of the box. The cost is significantly higher latency 10–50ms and memory footprint. For pure autocomplete no full-text , it's often over-engineered. For search that blends autocomplete with full-text ranking GitHub's code search, Slack's message search , it's the right choice. Hot prefixes: The prefix "the" gets astronomically more traffic than "xyz". The CDN edge layer handles this by caching top-200K prefixes globally, reducing origin traffic by 95%+. For the remaining hot prefixes that miss CDN, the Redis layer absorbs the load. Cache stampede: When a cached popular prefix expires, thousands of concurrent threads miss simultaneously. Solution: probabilistic early expiry. With probability p = max 0, remaining TTL - beta × log rand / TTL , proactively refresh. This spreads the refresh work over the TTL window rather than concentrating it at expiry. Memory consumption: Raw Tries use too much RAM. Solutions: 1 serialize Tries to compact binary format FST-style , 2 store only top-K completions per node discard the full word list , 3 shard aggressively to distribute memory across many nodes. Typo tolerance: Pure Trie cannot handle "appel" → "apple". Solutions: 1 generate edit-distance-1 candidates client-side and query each expensive , 2 use a separate spell-correction layer SymSpell, BK-trees before the Trie lookup, 3 use Elasticsearch with fuzzy query for typo-tolerant autocomplete. International languages: UTF-8 prefixes mean up to 65,000+ possible "characters" per node. Solutions: 1 store children as a HashMap not a fixed 26-element array , 2 use Unicode normalization NFC/NFD to collapse accented variants, 3 use language-specific tokenization Chinese/Japanese use character-level Tries since there are no spaces . Personalization: Cannot store personal results in the global Trie. Architecture: 1 server returns top-K global suggestions plus top-K personal suggestions from a lightweight personal history store user's last N searches in a sorted set ; 2 client-side merging re-ranks combined list by a blend of global and personal scores. How this appears in FAANG interviews: Typically given as "Design Google Search Autocomplete" or "Design a type-ahead feature for Twitter." You have 45 minutes. Interviewers want to see that you progress from Trie basics to scale concerns to production tradeoffs. Expected progression: Common follow-up questions: Common mistakes candidates make: Senior/Architect-level expectations: You should bring up: cache stampede prevention, blue-green Trie deployment, FST vs Trie tradeoffs, the real-time trending layer architecture, and the distinction between global and personalized suggestions. Let's implement four layers: basic Trie, Trie with frequency, Trie with Top-K, and cache integration.PHP isn't available in this environment, but the code is complete and correct. Let me copy all files to the output directory so you can download and run them.Run locally with php demo.php — requires PHP 8.1+. The demo will show cache hit vs miss timing, top-K reranking after frequency updates, and cache version bumping. Here's the mental model to carry into any interview: The system has two completely separate paths — a read path that must be blindingly fast, and a write path that can be slow and batch-oriented. Read path: User keystroke → CDN first line of defense, handles 60–70% of traffic → Load Balancer → Stateless API Server normalizes, debounces → Redis Cache lookup 95%+ hit rate → on miss, Trie Service in-memory, sharded by prefix, replicated . The Trie node itself returns pre-computed top-K in O L time. Total: under 100ms. Write path: Every completed search → Kafka → Spark frequency aggregation hourly window → Trie builder inserts with top-K propagation → Validation → Blue-Green deployment swap + Redis version bump. Runs hourly. Trending signals can be overlaid in near-real-time as a separate lightweight layer. The three insights that separate senior answers: