Building Typo-Tolerant Multi-Language Video Search with OpenSearch and PHP

At TopVideoHub, a video aggregator pulling trending content from nine Asia-Pacific regions every four hours, engineers migrated from SQLite FTS5 to OpenSearch to implement typo-tolerant search across Japanese, Korean, Mandarin, Vietnamese, Thai, and English titles. The original setup collapsed on misspellings like "blakpink" for "blackpink" or "Aimyon" for "あいみょん," while OpenSearch's multi-field mapping with ICU, kuromoji, nori, and smartcn analyzers now handles fuzziness and language-aware tokenization on a single-node deployment.

When you run a video aggregator that pulls trending content from nine Asia-Pacific regions every four hours, search becomes the place where users either find what they want or leave for YouTube directly. At TopVideoHub https://topvideohub.com we aggregate clips across Japanese, Korean, Mandarin, Vietnamese, Thai, and English titles in the same index, and our original SQLite FTS5 setup with a CJK tokenizer worked well for exact substring matches but collapsed the moment a user typed blakpink instead of blackpink , or searched for Aimyon when our database stored あいみょん . This post walks through the migration we did from SQLite FTS5 to OpenSearch for typo-tolerant search, while keeping FTS5 as a fallback. Running on a budget LiteSpeed host means we could not just throw a five-node Elasticsearch cluster at the problem — we had to be careful about heap pressure, indexing latency, and how Cloudflare caches the search responses. SQLite FTS5 with the unicode61 tokenizer plus a CJK bigram tokenizer handles tokenization across scripts well enough. Our migration looked like this: CREATE VIRTUAL TABLE videos fts USING fts5 title, channel name, description, tokenize = 'unicode61 remove diacritics 2', content = 'videos', content rowid = 'id' ; This gave us fast prefix and substring matching. The problem is that FTS5 does not implement fuzziness in the Damerau-Levenshtein sense. If a user searched for blakpink , FTS5 returned zero results because the trigram split bla , lak , akp , kpi , pin , ink shared almost no overlap with the indexed blackpink tokens. We tried compensating with edge n-grams stored as a denormalized column, then with manual misspelling synonym dictionaries. Both approaches worked for narrow cases but exploded index size and required hand-maintained mappings for every new artist or trending phrase. For CJK content, n-grams of length 1-2 produced massive recall but garbage precision — searching for 新 returned 40,000 plus videos because the character appears in roughly every fourth Chinese title. The decision point was clear. We needed an engine that supports proper fuzziness with edit distance, language-aware analyzers, and the ability to weight matches across multiple fields. OpenSearch fit the bill, especially because we could run a single-node deployment on a small VPS we already had provisioned for blog ingestion. Self-hosted, no managed-service bill, and the OpenSearch 2.x line ships with the analyzers we needed out of the box. The first real engineering decision was the index mapping. CJK languages do not use whitespace as a token boundary, so the standard analyzer is useless for Japanese, Chinese, and Korean. OpenSearch ships with the ICU analysis plugin, and there are well-maintained kuromoji Japanese , nori Korean , and smartcn Chinese plugins. Rather than maintain one analyzer per language and a router on the application side, we used the multi-fields pattern. The same source text is indexed under several analyzers and we search them with multi match . The query engine picks the best match per shard, and we score-weight them on the application side. Here is the mapping we settled on after about two weeks of tuning: { "settings": { "number of shards": 1, "number of replicas": 0, "analysis": { "analyzer": { "title standard": { "type": "custom", "tokenizer": "icu tokenizer", "filter": "icu folding", "lowercase" }, "title edge ngram": { "type": "custom", "tokenizer": "icu tokenizer", "filter": "icu folding", "lowercase", "edge ngram filter" }, "title cjk bigram": { "type": "custom", "tokenizer": "icu tokenizer", "filter": "cjk bigram", "lowercase" } }, "filter": { "edge ngram filter": { "type": "edge ngram", "min gram": 2, "max gram": 15 } } } }, "mappings": { "properties": { "video id": { "type": "keyword" }, "title": { "type": "text", "analyzer": "title standard", "fields": { "edge": { "type": "text", "analyzer": "title edge ngram", "search analyzer": "title standard" }, "cjk": { "type": "text", "analyzer": "title cjk bigram" }, "keyword": { "type": "keyword", "ignore above": 256 } } }, "channel name": { "type": "text", "analyzer": "title standard" }, "region": { "type": "keyword" }, "category id": { "type": "integer" }, "published at": { "type": "date" }, "view count": { "type": "long" }, "duration seconds": { "type": "integer" } } } } A few design choices worth explaining: number of replicas to zero because a single-node deployment cannot replicate, and the default of one leaves the cluster in yellow status forever and breaks health-check scripts. icu folding handles diacritic stripping for Vietnamese hóa becomes hoa and width normalization for fullwidth Latin ＨＥＬＬＯ becomes hello . cjk bigram filter splits CJK characters into overlapping pairs, so 日本語 indexes as 日本 , 本語 . This is the standard approach when you do not want to ship and version a heavy morphological dictionary. search analyzer differs from analyzer on the edge subfield. We index with edge n-grams but search without them. Otherwise every search term would also explode into n-grams on the query side, and the relevance scoring would be unusable.Our existing fetch cron pulls from the YouTube Data API every 2-7 hours depending on the site, normalizes the payload, and writes to SQLite. The OpenSearch indexing hooks into the same path as a fire-and-forget bulk step at the end of each fetch cycle. If OpenSearch is unavailable, we log and move on — SQLite remains the source of truth. php <?php declare strict types=1 ; namespace App\Search; final class OpenSearchIndexer { private const ENDPOINT = 'http://127.0.0.1:9200'; private const INDEX = 'videos v3'; private const BATCH SIZE = 200; public function construct private readonly \PDO $db, private readonly \App\Logger $log, {} public function indexBatch array $videoIds : int { if $videoIds === { return 0; } $placeholders = implode ',', array fill 0, count $videoIds , '?' ; $stmt = $this- db- prepare "SELECT id, video id, title, channel name, region, category id, published at, view count, duration seconds FROM videos WHERE id IN $placeholders " ; $stmt- execute $videoIds ; $bulk = ''; $count = 0; while $row = $stmt- fetch \PDO::FETCH ASSOC { $bulk .= json encode 'index' = ' index' = self::INDEX, ' id' = $row 'video id' , , , JSON THROW ON ERROR . "\n"; $bulk .= json encode 'video id' = $row 'video id' , 'title' = $row 'title' , 'channel name' = $row 'channel name' , 'region' = $row 'region' , 'category id' = int $row 'category id' , 'published at' = $row 'published at' , 'view count' = int $row 'view count' , 'duration seconds' = int $row 'duration seconds' , , JSON THROW ON ERROR | JSON UNESCAPED UNICODE . "\n"; $count++; } $response = $this- postBulk $bulk ; if $response === null { $this- log- warn 'opensearch.bulk.failed', 'count' = $count ; return 0; } if empty $response 'errors' { $this- logBulkErrors $response 'items' ; } return $count; } private function postBulk string $payload : ?array { $ch = curl init self::ENDPOINT . '/ bulk' ; curl setopt array $ch, CURLOPT POST = true, CURLOPT POSTFIELDS = $payload, CURLOPT HTTPHEADER = 'Content-Type: application/x-ndjson' , CURLOPT RETURNTRANSFER = true, CURLOPT TIMEOUT = 8, CURLOPT CONNECTTIMEOUT = 2, ; $body = curl exec $ch ; $code = curl getinfo $ch, CURLINFO HTTP CODE ; curl close $ch ; if $code == 200 || $body === false { return null; } return json decode $body, true, 512, JSON THROW ON ERROR ; } private function logBulkErrors array $items : void { $failures = ; foreach $items as $item { $op = $item 'index' ?? $item 'create' ?? null; if $op === null || $op 'status' ?? 200 < 400 { continue; } $failures = 'id' = $op ' id' ?? null, 'status' = $op 'status' , 'error' = $op 'error' 'reason' ?? 'unknown', ; } if $failures == { $this- log- warn 'opensearch.bulk.items failed', $failures ; } } } The eight-second timeout matters. PHP-FPM under LiteSpeed has a 180-second ceiling, but if OpenSearch is overloaded we want to surface failures fast, not hold the worker. We batch in 200-document chunks because smaller batches give us better backpressure when JVM heap is tight, and the bulk endpoint copes well at this size. One thing we learned the hard way: do not use the external versioning mode unless you have a real monotonic version. We tried using published at as the version field, but YouTube backfills publishedAt sometimes when a video gets re-uploaded under the same ID, and the new value was occasionally smaller than the old one. Documents stopped updating silently. Default internal versioning is the safe choice for a feed-style index. The interesting part is the query side. A user types something — possibly with typos, possibly in CJK, possibly mixed BTS dynamite live . We want fuzzy matching for Latin terms, exact bigram matching for CJK, and edge n-gram matching for prefix-as-you-type behavior. All three need to score together so the best result wins. We use a single multi match with best fields type, plus a fuzziness parameter that only kicks in for Latin scripts. php <?php declare strict types=1 ; namespace App\Search; final class OpenSearchQuery { private const ENDPOINT = 'http://127.0.0.1:9200'; private const INDEX = 'videos v3'; public function search string $term, int $limit = 24, ?string $region = null : array { $term = trim $term ; if $term === '' { return ; } $isCjk = preg match '/ \x{3000}-\x{9FFF}\x{AC00}-\x{D7AF} /u', $term === 1; $fuzziness = $isCjk ? '0' : 'AUTO:4,7'; $query = 'size' = $limit, 'query' = 'function score' = 'query' = 'bool' = 'should' = 'multi match' = 'query' = $term, 'fields' = 'title^3', 'channel name^1.5' , 'type' = 'best fields', 'fuzziness' = $fuzziness, 'prefix length' = 1, 'max expansions' = 30, , , 'match' = 'title.edge' = 'query' = $term, 'boost' = 2.0, , , , 'match' = 'title.cjk' = 'query' = $term, 'boost' = $isCjk ? 4.0 : 0.5, , , , , 'minimum should match' = 1, 'filter' = $this- buildFilters $region , , , 'functions' = 'gauss' = 'published at' = 'origin' = 'now', 'scale' = '30d', 'decay' = 0.5, , , , 'field value factor' = 'field' = 'view count', 'modifier' = 'log1p', 'factor' = 0.2, 'missing' = 0, , , , 'score mode' = 'sum', 'boost mode' = 'multiply', , , ; return $this- execute $query ; } private function buildFilters ?string $region : array { $filters = 'range' = 'duration seconds' = 'gte' = 30, 'lte' = 1800 , ; if $region == null { $filters = 'term' = 'region' = $region ; } return $filters; } private function execute array $query : array { $ch = curl init self::ENDPOINT . '/' . self::INDEX . '/ search' ; curl setopt array $ch, CURLOPT POST = true, CURLOPT POSTFIELDS = json encode $query, JSON UNESCAPED UNICODE | JSON THROW ON ERROR , CURLOPT HTTPHEADER = 'Content-Type: application/json' , CURLOPT RETURNTRANSFER = true, CURLOPT TIMEOUT = 2, CURLOPT CONNECTTIMEOUT = 1, ; $body = curl exec $ch ; $code = curl getinfo $ch, CURLINFO HTTP CODE ; curl close $ch ; if $code == 200 || $body === false { throw new SearchUnavailable 'OpenSearch returned ' . $code ; } $decoded = json decode $body, true, 512, JSON THROW ON ERROR ; return array map fn array $hit = 'id' = $hit ' id' , 'score' = $hit ' score' + $hit ' source' , $decoded 'hits' 'hits' ?? ; } } The key decisions here: fuzziness: AUTO:4,7 means terms of length 1-3 match exactly, length 4-6 allow one edit, length 7 and up allow two edits. For CJK we disable fuzziness entirely because edit distance on logographic characters returns nonsense — 日本 and 本日 are one transposition apart but mean different things. prefix length: 1 forces the first character to match, which dramatically reduces the candidate set for short queries that would otherwise expand into thousands of fuzzy matches. max expansions: 30 caps how many terms each fuzzy query can expand to, protecting us from a pathological single-letter search melting the heap. function score decays results by publish date with a 30-day Gaussian half-life and boosts by view count log1p keeps mega-viral videos from completely dominating . k-pop from accidentally matching every Korean clip through romanization noise. duration seconds excludes Shorts under 30s and full episodes over 30min , which is a TopVideoHub-specific UX choice.OpenSearch processes crash. The JVM does long GC pauses. The VPS reboots. In a search UX, going from fuzzy results to no results because the daemon died for 90 seconds is unacceptable. We wrapped the OpenSearch query in a circuit breaker that falls back to SQLite FTS5 on repeated failure. The pattern is straightforward, but state has to be process-local because we are on PHP — no shared memory unless we want to involve APCu or Redis. A tiny JSON file in /tmp is enough because LiteSpeed pins the PHP worker pool to a single host. php <?php declare strict types=1 ; namespace App\Search; final class SearchService { private const CIRCUIT FILE = '/tmp/opensearch circuit.json'; private const FAILURE THRESHOLD = 3; private const COOLDOWN SECONDS = 60; public function construct private readonly OpenSearchQuery $primary, private readonly Fts5Search $fallback, private readonly \App\Logger $log, {} public function search string $term, int $limit = 24, ?string $region = null : array { if $this- circuitOpen { return $this- fallback- search $term, $limit, $region ; } try { $results = $this- primary- search $term, $limit, $region ; $this- recordSuccess ; return $results; } catch SearchUnavailable $e { $this- recordFailure ; $this- log- warn 'search.fallback', 'error' = $e- getMessage ; return $this- fallback- search $term, $limit, $region ; } } private function circuitOpen : bool { $state = $this- readState ; if $state 'failures' < self::FAILURE THRESHOLD { return false; } return time - $state 'last failure' < self::COOLDOWN SECONDS; } private function recordSuccess : void { $this- writeState 'failures' = 0, 'last failure' = 0 ; } private function recordFailure : void { $state = $this- readState ; $this- writeState 'failures' = $state 'failures' + 1, 'last failure' = time , ; } private function readState : array { if is file self::CIRCUIT FILE { return 'failures' = 0, 'last failure' = 0 ; } $raw = @file get contents self::CIRCUIT FILE ; if $raw === false { return 'failures' = 0, 'last failure' = 0 ; } return json decode $raw, true ?: 'failures' = 0, 'last failure' = 0 ; } private function writeState array $state : void { @file put contents self::CIRCUIT FILE, json encode $state , LOCK EX ; } } After three consecutive failures we stop hitting OpenSearch for 60 seconds and route everything to FTS5. After the cooldown we let one request through; if it succeeds, the counter resets. The fallback FTS5 path returns visibly fewer results for typo-heavy queries, but it always returns something, which is the important property. The search endpoint gets hammered — roughly 8% of pageviews. We could not let every search query reach origin. But search results are user-typed input, so caching the wrong response is a real risk if Vary is wrong. The approach was three layers: Cache-Control: public, max-age=600, stale-while-revalidate=1800 on search responses with a non-empty query string. <IfModule LiteSpeed block adds CacheLookup public on for /search , so LSCache kicks in before PHP even boots.Cloudflare Free does not cache HTML by default. We added a Cache Rule for URI Path starts with "/search" to set Cache Level to Cache Everything with TTL 600. Combined with Vary: Cookie scoped to only the region cookie, this works without leaking sessions across users. The gotcha was personalization: logged-in users see watch-later markers on result cards, and those would poison the cache. We solved this by rendering the user-specific overlay client-side from a tiny /api/watchlater.json endpoint that the cache layer skips. Nothing about the cached HTML body is user-specific. After two weeks of running OpenSearch alongside FTS5 with traffic split 90/10 to OpenSearch, the results were: The fuzzy queries that drove the most improvement, in rough order of impact: blakpink , newgens , itzy variants . hoa minzy matching Hòa Minzy . k pop matching k-pop , bts v matching BTS-V . aimyon matching あいみょん , which is a separate post.OpenSearch was the right tool for the typo-tolerance problem, but the migration was as much about the surrounding pieces as the engine itself. The circuit breaker that protects PHP workers from a sick OpenSearch process, the LSCache plus Cloudflare cache layer that absorbs most of the search traffic before queries even hit Lucene, and the per-field analyzer strategy for CJK content are what made it production-viable on a budget LiteSpeed host. SQLite FTS5 still has a job as the fallback and as the source of truth for exact-substring features like channel-scoped search, but for free-form user queries with typos, edit-distance fuzziness is non-negotiable. If you are running a similar multi-language aggregator and your search box still goes straight to FTS5 or LIKE queries, the migration is doable in a couple of weeks with one engineer. Start with the index mapping, build the indexing pipeline, then layer the circuit breaker before cutting traffic over.