Building Typo-Tolerant Multi-Language Video Search with OpenSearch and PHP At TopVideoHub, a video aggregator pulling trending content from nine Asia-Pacific regions every four hours, engineers migrated from SQLite FTS5 to OpenSearch to implement typo-tolerant search across Japanese, Korean, Mandarin, Vietnamese, Thai, and English titles. The original setup collapsed on misspellings like "blakpink" for "blackpink" or "Aimyon" for "あいみょん," while OpenSearch's multi-field mapping with ICU, kuromoji, nori, and smartcn analyzers now handles fuzziness and language-aware tokenization on a single-node deployment. When you run a video aggregator that pulls trending content from nine Asia-Pacific regions every four hours, search becomes the place where users either find what they want or leave for YouTube directly. At TopVideoHub https://topvideohub.com we aggregate clips across Japanese, Korean, Mandarin, Vietnamese, Thai, and English titles in the same index, and our original SQLite FTS5 setup with a CJK tokenizer worked well for exact substring matches but collapsed the moment a user typed blakpink instead of blackpink , or searched for Aimyon when our database stored あいみょん . This post walks through the migration we did from SQLite FTS5 to OpenSearch for typo-tolerant search, while keeping FTS5 as a fallback. Running on a budget LiteSpeed host means we could not just throw a five-node Elasticsearch cluster at the problem — we had to be careful about heap pressure, indexing latency, and how Cloudflare caches the search responses. SQLite FTS5 with the unicode61 tokenizer plus a CJK bigram tokenizer handles tokenization across scripts well enough. Our migration looked like this: CREATE VIRTUAL TABLE videos fts USING fts5 title, channel name, description, tokenize = 'unicode61 remove diacritics 2', content = 'videos', content rowid = 'id' ; This gave us fast prefix and substring matching. The problem is that FTS5 does not implement fuzziness in the Damerau-Levenshtein sense. If a user searched for blakpink , FTS5 returned zero results because the trigram split bla , lak , akp , kpi , pin , ink shared almost no overlap with the indexed blackpink tokens. We tried compensating with edge n-grams stored as a denormalized column, then with manual misspelling synonym dictionaries. Both approaches worked for narrow cases but exploded index size and required hand-maintained mappings for every new artist or trending phrase. For CJK content, n-grams of length 1-2 produced massive recall but garbage precision — searching for 新 returned 40,000 plus videos because the character appears in roughly every fourth Chinese title. The decision point was clear. We needed an engine that supports proper fuzziness with edit distance, language-aware analyzers, and the ability to weight matches across multiple fields. OpenSearch fit the bill, especially because we could run a single-node deployment on a small VPS we already had provisioned for blog ingestion. Self-hosted, no managed-service bill, and the OpenSearch 2.x line ships with the analyzers we needed out of the box. The first real engineering decision was the index mapping. CJK languages do not use whitespace as a token boundary, so the standard analyzer is useless for Japanese, Chinese, and Korean. OpenSearch ships with the ICU analysis plugin, and there are well-maintained kuromoji Japanese , nori Korean , and smartcn Chinese plugins. Rather than maintain one analyzer per language and a router on the application side, we used the multi-fields pattern. The same source text is indexed under several analyzers and we search them with multi match . The query engine picks the best match per shard, and we score-weight them on the application side. Here is the mapping we settled on after about two weeks of tuning: { "settings": { "number of shards": 1, "number of replicas": 0, "analysis": { "analyzer": { "title standard": { "type": "custom", "tokenizer": "icu tokenizer", "filter": "icu folding", "lowercase" }, "title edge ngram": { "type": "custom", "tokenizer": "icu tokenizer", "filter": "icu folding", "lowercase", "edge ngram filter" }, "title cjk bigram": { "type": "custom", "tokenizer": "icu tokenizer", "filter": "cjk bigram", "lowercase" } }, "filter": { "edge ngram filter": { "type": "edge ngram", "min gram": 2, "max gram": 15 } } } }, "mappings": { "properties": { "video id": { "type": "keyword" }, "title": { "type": "text", "analyzer": "title standard", "fields": { "edge": { "type": "text", "analyzer": "title edge ngram", "search analyzer": "title standard" }, "cjk": { "type": "text", "analyzer": "title cjk bigram" }, "keyword": { "type": "keyword", "ignore above": 256 } } }, "channel name": { "type": "text", "analyzer": "title standard" }, "region": { "type": "keyword" }, "category id": { "type": "integer" }, "published at": { "type": "date" }, "view count": { "type": "long" }, "duration seconds": { "type": "integer" } } } } A few design choices worth explaining: number of replicas to zero because a single-node deployment cannot replicate, and the default of one leaves the cluster in yellow status forever and breaks health-check scripts. icu folding handles diacritic stripping for Vietnamese hóa becomes hoa and width normalization for fullwidth Latin HELLO becomes hello . cjk bigram filter splits CJK characters into overlapping pairs, so 日本語 indexes as 日本 , 本語 . This is the standard approach when you do not want to ship and version a heavy morphological dictionary. search analyzer differs from analyzer on the edge subfield. We index with edge n-grams but search without them. Otherwise every search term would also explode into n-grams on the query side, and the relevance scoring would be unusable.Our existing fetch cron pulls from the YouTube Data API every 2-7 hours depending on the site, normalizes the payload, and writes to SQLite. The OpenSearch indexing hooks into the same path as a fire-and-forget bulk step at the end of each fetch cycle. If OpenSearch is unavailable, we log and move on — SQLite remains the source of truth. php