{"slug": "building-typo-tolerant-multi-language-video-search-with-opensearch-and-php", "title": "Building Typo-Tolerant Multi-Language Video Search with OpenSearch and PHP", "summary": "At TopVideoHub, a video aggregator pulling trending content from nine Asia-Pacific regions every four hours, engineers migrated from SQLite FTS5 to OpenSearch to implement typo-tolerant search across Japanese, Korean, Mandarin, Vietnamese, Thai, and English titles. The original setup collapsed on misspellings like \"blakpink\" for \"blackpink\" or \"Aimyon\" for \"あいみょん,\" while OpenSearch's multi-field mapping with ICU, kuromoji, nori, and smartcn analyzers now handles fuzziness and language-aware tokenization on a single-node deployment.", "body_md": "When you run a video aggregator that pulls trending content from nine Asia-Pacific regions every four hours, search becomes the place where users either find what they want or leave for YouTube directly. At [TopVideoHub](https://topvideohub.com) we aggregate clips across Japanese, Korean, Mandarin, Vietnamese, Thai, and English titles in the same index, and our original SQLite FTS5 setup with a CJK tokenizer worked well for exact substring matches but collapsed the moment a user typed `blakpink`\n\ninstead of `blackpink`\n\n, or searched for `Aimyon`\n\nwhen our database stored `あいみょん`\n\n.\n\nThis post walks through the migration we did from SQLite FTS5 to OpenSearch for typo-tolerant search, while keeping FTS5 as a fallback. Running on a budget LiteSpeed host means we could not just throw a five-node Elasticsearch cluster at the problem — we had to be careful about heap pressure, indexing latency, and how Cloudflare caches the search responses.\n\nSQLite FTS5 with the `unicode61`\n\ntokenizer plus a CJK bigram tokenizer handles tokenization across scripts well enough. Our migration looked like this:\n\n```\nCREATE VIRTUAL TABLE videos_fts USING fts5(\n  title,\n  channel_name,\n  description,\n  tokenize = 'unicode61 remove_diacritics 2',\n  content = 'videos',\n  content_rowid = 'id'\n);\n```\n\nThis gave us fast prefix and substring matching. The problem is that FTS5 does not implement fuzziness in the Damerau-Levenshtein sense. If a user searched for `blakpink`\n\n, FTS5 returned zero results because the trigram split (`bla`\n\n, `lak`\n\n, `akp`\n\n, `kpi`\n\n, `pin`\n\n, `ink`\n\n) shared almost no overlap with the indexed `blackpink`\n\ntokens.\n\nWe tried compensating with edge n-grams stored as a denormalized column, then with manual misspelling synonym dictionaries. Both approaches worked for narrow cases but exploded index size and required hand-maintained mappings for every new artist or trending phrase. For CJK content, n-grams of length 1-2 produced massive recall but garbage precision — searching for `新`\n\nreturned 40,000 plus videos because the character appears in roughly every fourth Chinese title.\n\nThe decision point was clear. We needed an engine that supports proper fuzziness with edit distance, language-aware analyzers, and the ability to weight matches across multiple fields. OpenSearch fit the bill, especially because we could run a single-node deployment on a small VPS we already had provisioned for blog ingestion. Self-hosted, no managed-service bill, and the OpenSearch 2.x line ships with the analyzers we needed out of the box.\n\nThe first real engineering decision was the index mapping. CJK languages do not use whitespace as a token boundary, so the standard analyzer is useless for Japanese, Chinese, and Korean. OpenSearch ships with the ICU analysis plugin, and there are well-maintained `kuromoji`\n\n(Japanese), `nori`\n\n(Korean), and `smartcn`\n\n(Chinese) plugins.\n\nRather than maintain one analyzer per language and a router on the application side, we used the multi-fields pattern. The same source text is indexed under several analyzers and we search them with `multi_match`\n\n. The query engine picks the best match per shard, and we score-weight them on the application side.\n\nHere is the mapping we settled on after about two weeks of tuning:\n\n```\n{\n  \"settings\": {\n    \"number_of_shards\": 1,\n    \"number_of_replicas\": 0,\n    \"analysis\": {\n      \"analyzer\": {\n        \"title_standard\": {\n          \"type\": \"custom\",\n          \"tokenizer\": \"icu_tokenizer\",\n          \"filter\": [\"icu_folding\", \"lowercase\"]\n        },\n        \"title_edge_ngram\": {\n          \"type\": \"custom\",\n          \"tokenizer\": \"icu_tokenizer\",\n          \"filter\": [\"icu_folding\", \"lowercase\", \"edge_ngram_filter\"]\n        },\n        \"title_cjk_bigram\": {\n          \"type\": \"custom\",\n          \"tokenizer\": \"icu_tokenizer\",\n          \"filter\": [\"cjk_bigram\", \"lowercase\"]\n        }\n      },\n      \"filter\": {\n        \"edge_ngram_filter\": {\n          \"type\": \"edge_ngram\",\n          \"min_gram\": 2,\n          \"max_gram\": 15\n        }\n      }\n    }\n  },\n  \"mappings\": {\n    \"properties\": {\n      \"video_id\": { \"type\": \"keyword\" },\n      \"title\": {\n        \"type\": \"text\",\n        \"analyzer\": \"title_standard\",\n        \"fields\": {\n          \"edge\": {\n            \"type\": \"text\",\n            \"analyzer\": \"title_edge_ngram\",\n            \"search_analyzer\": \"title_standard\"\n          },\n          \"cjk\": {\n            \"type\": \"text\",\n            \"analyzer\": \"title_cjk_bigram\"\n          },\n          \"keyword\": {\n            \"type\": \"keyword\",\n            \"ignore_above\": 256\n          }\n        }\n      },\n      \"channel_name\": { \"type\": \"text\", \"analyzer\": \"title_standard\" },\n      \"region\":       { \"type\": \"keyword\" },\n      \"category_id\":  { \"type\": \"integer\" },\n      \"published_at\": { \"type\": \"date\" },\n      \"view_count\":   { \"type\": \"long\" },\n      \"duration_seconds\": { \"type\": \"integer\" }\n    }\n  }\n}\n```\n\nA few design choices worth explaining:\n\n`number_of_replicas`\n\nto zero because a single-node deployment cannot replicate, and the default of one leaves the cluster in yellow status forever and breaks health-check scripts.`icu_folding`\n\nhandles diacritic stripping for Vietnamese (`hóa`\n\nbecomes `hoa`\n\n) and width normalization for fullwidth Latin (`ＨＥＬＬＯ`\n\nbecomes `hello`\n\n).`cjk_bigram`\n\nfilter splits CJK characters into overlapping pairs, so `日本語`\n\nindexes as `日本`\n\n, `本語`\n\n. This is the standard approach when you do not want to ship and version a heavy morphological dictionary.`search_analyzer`\n\ndiffers from `analyzer`\n\non the `edge`\n\nsubfield. We index with edge n-grams but search without them. Otherwise every search term would also explode into n-grams on the query side, and the relevance scoring would be unusable.Our existing fetch cron pulls from the YouTube Data API every 2-7 hours depending on the site, normalizes the payload, and writes to SQLite. The OpenSearch indexing hooks into the same path as a fire-and-forget bulk step at the end of each fetch cycle. If OpenSearch is unavailable, we log and move on — SQLite remains the source of truth.\n\n``` php\n<?php\n\ndeclare(strict_types=1);\n\nnamespace App\\Search;\n\nfinal class OpenSearchIndexer\n{\n    private const ENDPOINT = 'http://127.0.0.1:9200';\n    private const INDEX = 'videos_v3';\n    private const BATCH_SIZE = 200;\n\n    public function __construct(\n        private readonly \\PDO $db,\n        private readonly \\App\\Logger $log,\n    ) {}\n\n    public function indexBatch(array $videoIds): int\n    {\n        if ($videoIds === []) {\n            return 0;\n        }\n\n        $placeholders = implode(',', array_fill(0, count($videoIds), '?'));\n        $stmt = $this->db->prepare(\n            \"SELECT id, video_id, title, channel_name, region,\n                    category_id, published_at, view_count, duration_seconds\n             FROM videos WHERE id IN ($placeholders)\"\n        );\n        $stmt->execute($videoIds);\n\n        $bulk = '';\n        $count = 0;\n        while ($row = $stmt->fetch(\\PDO::FETCH_ASSOC)) {\n            $bulk .= json_encode([\n                'index' => [\n                    '_index' => self::INDEX,\n                    '_id'    => $row['video_id'],\n                ],\n            ], JSON_THROW_ON_ERROR) . \"\\n\";\n\n            $bulk .= json_encode([\n                'video_id'         => $row['video_id'],\n                'title'            => $row['title'],\n                'channel_name'     => $row['channel_name'],\n                'region'           => $row['region'],\n                'category_id'      => (int)$row['category_id'],\n                'published_at'     => $row['published_at'],\n                'view_count'       => (int)$row['view_count'],\n                'duration_seconds' => (int)$row['duration_seconds'],\n            ], JSON_THROW_ON_ERROR | JSON_UNESCAPED_UNICODE) . \"\\n\";\n\n            $count++;\n        }\n\n        $response = $this->postBulk($bulk);\n        if ($response === null) {\n            $this->log->warn('opensearch.bulk.failed', ['count' => $count]);\n            return 0;\n        }\n\n        if (!empty($response['errors'])) {\n            $this->logBulkErrors($response['items']);\n        }\n\n        return $count;\n    }\n\n    private function postBulk(string $payload): ?array\n    {\n        $ch = curl_init(self::ENDPOINT . '/_bulk');\n        curl_setopt_array($ch, [\n            CURLOPT_POST           => true,\n            CURLOPT_POSTFIELDS     => $payload,\n            CURLOPT_HTTPHEADER     => ['Content-Type: application/x-ndjson'],\n            CURLOPT_RETURNTRANSFER => true,\n            CURLOPT_TIMEOUT        => 8,\n            CURLOPT_CONNECTTIMEOUT => 2,\n        ]);\n        $body = curl_exec($ch);\n        $code = curl_getinfo($ch, CURLINFO_HTTP_CODE);\n        curl_close($ch);\n\n        if ($code !== 200 || $body === false) {\n            return null;\n        }\n\n        return json_decode($body, true, 512, JSON_THROW_ON_ERROR);\n    }\n\n    private function logBulkErrors(array $items): void\n    {\n        $failures = [];\n        foreach ($items as $item) {\n            $op = $item['index'] ?? $item['create'] ?? null;\n            if ($op === null || ($op['status'] ?? 200) < 400) {\n                continue;\n            }\n            $failures[] = [\n                'id'     => $op['_id'] ?? null,\n                'status' => $op['status'],\n                'error'  => $op['error']['reason'] ?? 'unknown',\n            ];\n        }\n        if ($failures !== []) {\n            $this->log->warn('opensearch.bulk.items_failed', $failures);\n        }\n    }\n}\n```\n\nThe eight-second timeout matters. PHP-FPM under LiteSpeed has a 180-second ceiling, but if OpenSearch is overloaded we want to surface failures fast, not hold the worker. We batch in 200-document chunks because smaller batches give us better backpressure when JVM heap is tight, and the bulk endpoint copes well at this size.\n\nOne thing we learned the hard way: do not use the `external`\n\nversioning mode unless you have a real monotonic version. We tried using `published_at`\n\nas the version field, but YouTube backfills `publishedAt`\n\nsometimes when a video gets re-uploaded under the same ID, and the new value was occasionally smaller than the old one. Documents stopped updating silently. Default internal versioning is the safe choice for a feed-style index.\n\nThe interesting part is the query side. A user types something — possibly with typos, possibly in CJK, possibly mixed (`BTS dynamite live`\n\n). We want fuzzy matching for Latin terms, exact bigram matching for CJK, and edge n-gram matching for prefix-as-you-type behavior. All three need to score together so the best result wins.\n\nWe use a single `multi_match`\n\nwith `best_fields`\n\ntype, plus a `fuzziness`\n\nparameter that only kicks in for Latin scripts.\n\n``` php\n<?php\n\ndeclare(strict_types=1);\n\nnamespace App\\Search;\n\nfinal class OpenSearchQuery\n{\n    private const ENDPOINT = 'http://127.0.0.1:9200';\n    private const INDEX = 'videos_v3';\n\n    public function search(string $term, int $limit = 24, ?string $region = null): array\n    {\n        $term = trim($term);\n        if ($term === '') {\n            return [];\n        }\n\n        $isCjk = preg_match('/[\\x{3000}-\\x{9FFF}\\x{AC00}-\\x{D7AF}]/u', $term) === 1;\n        $fuzziness = $isCjk ? '0' : 'AUTO:4,7';\n\n        $query = [\n            'size' => $limit,\n            'query' => [\n                'function_score' => [\n                    'query' => [\n                        'bool' => [\n                            'should' => [\n                                [\n                                    'multi_match' => [\n                                        'query'          => $term,\n                                        'fields'         => ['title^3', 'channel_name^1.5'],\n                                        'type'           => 'best_fields',\n                                        'fuzziness'      => $fuzziness,\n                                        'prefix_length'  => 1,\n                                        'max_expansions' => 30,\n                                    ],\n                                ],\n                                [\n                                    'match' => [\n                                        'title.edge' => [\n                                            'query' => $term,\n                                            'boost' => 2.0,\n                                        ],\n                                    ],\n                                ],\n                                [\n                                    'match' => [\n                                        'title.cjk' => [\n                                            'query' => $term,\n                                            'boost' => $isCjk ? 4.0 : 0.5,\n                                        ],\n                                    ],\n                                ],\n                            ],\n                            'minimum_should_match' => 1,\n                            'filter' => $this->buildFilters($region),\n                        ],\n                    ],\n                    'functions' => [\n                        [\n                            'gauss' => [\n                                'published_at' => [\n                                    'origin' => 'now',\n                                    'scale'  => '30d',\n                                    'decay'  => 0.5,\n                                ],\n                            ],\n                        ],\n                        [\n                            'field_value_factor' => [\n                                'field'    => 'view_count',\n                                'modifier' => 'log1p',\n                                'factor'   => 0.2,\n                                'missing'  => 0,\n                            ],\n                        ],\n                    ],\n                    'score_mode' => 'sum',\n                    'boost_mode' => 'multiply',\n                ],\n            ],\n        ];\n\n        return $this->execute($query);\n    }\n\n    private function buildFilters(?string $region): array\n    {\n        $filters = [\n            ['range' => ['duration_seconds' => ['gte' => 30, 'lte' => 1800]]],\n        ];\n        if ($region !== null) {\n            $filters[] = ['term' => ['region' => $region]];\n        }\n        return $filters;\n    }\n\n    private function execute(array $query): array\n    {\n        $ch = curl_init(self::ENDPOINT . '/' . self::INDEX . '/_search');\n        curl_setopt_array($ch, [\n            CURLOPT_POST           => true,\n            CURLOPT_POSTFIELDS     => json_encode($query, JSON_UNESCAPED_UNICODE | JSON_THROW_ON_ERROR),\n            CURLOPT_HTTPHEADER     => ['Content-Type: application/json'],\n            CURLOPT_RETURNTRANSFER => true,\n            CURLOPT_TIMEOUT        => 2,\n            CURLOPT_CONNECTTIMEOUT => 1,\n        ]);\n        $body = curl_exec($ch);\n        $code = curl_getinfo($ch, CURLINFO_HTTP_CODE);\n        curl_close($ch);\n\n        if ($code !== 200 || $body === false) {\n            throw new SearchUnavailable('OpenSearch returned ' . $code);\n        }\n\n        $decoded = json_decode($body, true, 512, JSON_THROW_ON_ERROR);\n        return array_map(\n            fn(array $hit) => ['id' => $hit['_id'], 'score' => $hit['_score']] + $hit['_source'],\n            $decoded['hits']['hits'] ?? []\n        );\n    }\n}\n```\n\nThe key decisions here:\n\n`fuzziness: AUTO:4,7`\n\nmeans terms of length 1-3 match exactly, length 4-6 allow one edit, length 7 and up allow two edits. For CJK we disable fuzziness entirely because edit distance on logographic characters returns nonsense — `日本`\n\nand `本日`\n\nare one transposition apart but mean different things.`prefix_length: 1`\n\nforces the first character to match, which dramatically reduces the candidate set for short queries that would otherwise expand into thousands of fuzzy matches.`max_expansions: 30`\n\ncaps how many terms each fuzzy query can expand to, protecting us from a pathological single-letter search melting the heap.`function_score`\n\ndecays results by publish date with a 30-day Gaussian half-life and boosts by view count (`log1p`\n\nkeeps mega-viral videos from completely dominating).`k-pop`\n\nfrom accidentally matching every Korean clip through romanization noise.`duration_seconds`\n\nexcludes Shorts (under 30s) and full episodes (over 30min), which is a TopVideoHub-specific UX choice.OpenSearch processes crash. The JVM does long GC pauses. The VPS reboots. In a search UX, going from fuzzy results to no results because the daemon died for 90 seconds is unacceptable. We wrapped the OpenSearch query in a circuit breaker that falls back to SQLite FTS5 on repeated failure.\n\nThe pattern is straightforward, but state has to be process-local because we are on PHP — no shared memory unless we want to involve APCu or Redis. A tiny JSON file in `/tmp`\n\nis enough because LiteSpeed pins the PHP worker pool to a single host.\n\n``` php\n<?php\n\ndeclare(strict_types=1);\n\nnamespace App\\Search;\n\nfinal class SearchService\n{\n    private const CIRCUIT_FILE = '/tmp/opensearch_circuit.json';\n    private const FAILURE_THRESHOLD = 3;\n    private const COOLDOWN_SECONDS = 60;\n\n    public function __construct(\n        private readonly OpenSearchQuery $primary,\n        private readonly Fts5Search $fallback,\n        private readonly \\App\\Logger $log,\n    ) {}\n\n    public function search(string $term, int $limit = 24, ?string $region = null): array\n    {\n        if ($this->circuitOpen()) {\n            return $this->fallback->search($term, $limit, $region);\n        }\n\n        try {\n            $results = $this->primary->search($term, $limit, $region);\n            $this->recordSuccess();\n            return $results;\n        } catch (SearchUnavailable $e) {\n            $this->recordFailure();\n            $this->log->warn('search.fallback', ['error' => $e->getMessage()]);\n            return $this->fallback->search($term, $limit, $region);\n        }\n    }\n\n    private function circuitOpen(): bool\n    {\n        $state = $this->readState();\n        if ($state['failures'] < self::FAILURE_THRESHOLD) {\n            return false;\n        }\n        return (time() - $state['last_failure']) < self::COOLDOWN_SECONDS;\n    }\n\n    private function recordSuccess(): void\n    {\n        $this->writeState(['failures' => 0, 'last_failure' => 0]);\n    }\n\n    private function recordFailure(): void\n    {\n        $state = $this->readState();\n        $this->writeState([\n            'failures'     => $state['failures'] + 1,\n            'last_failure' => time(),\n        ]);\n    }\n\n    private function readState(): array\n    {\n        if (!is_file(self::CIRCUIT_FILE)) {\n            return ['failures' => 0, 'last_failure' => 0];\n        }\n        $raw = @file_get_contents(self::CIRCUIT_FILE);\n        if ($raw === false) {\n            return ['failures' => 0, 'last_failure' => 0];\n        }\n        return json_decode($raw, true) ?: ['failures' => 0, 'last_failure' => 0];\n    }\n\n    private function writeState(array $state): void\n    {\n        @file_put_contents(self::CIRCUIT_FILE, json_encode($state), LOCK_EX);\n    }\n}\n```\n\nAfter three consecutive failures we stop hitting OpenSearch for 60 seconds and route everything to FTS5. After the cooldown we let one request through; if it succeeds, the counter resets. The fallback FTS5 path returns visibly fewer results for typo-heavy queries, but it always returns something, which is the important property.\n\nThe search endpoint gets hammered — roughly 8% of pageviews. We could not let every search query reach origin. But search results are user-typed input, so caching the wrong response is a real risk if `Vary`\n\nis wrong.\n\nThe approach was three layers:\n\n`Cache-Control: public, max-age=600, stale-while-revalidate=1800`\n\non search responses with a non-empty query string.`<IfModule LiteSpeed>`\n\nblock adds `CacheLookup public on`\n\nfor `/search`\n\n, so LSCache kicks in before PHP even boots.Cloudflare Free does not cache HTML by default. We added a Cache Rule for `URI Path starts with \"/search\"`\n\nto set Cache Level to Cache Everything with TTL 600. Combined with `Vary: Cookie`\n\nscoped to only the region cookie, this works without leaking sessions across users.\n\nThe gotcha was personalization: logged-in users see watch-later markers on result cards, and those would poison the cache. We solved this by rendering the user-specific overlay client-side from a tiny `/api/watchlater.json`\n\nendpoint that the cache layer skips. Nothing about the cached HTML body is user-specific.\n\nAfter two weeks of running OpenSearch alongside FTS5 with traffic split 90/10 to OpenSearch, the results were:\n\nThe fuzzy queries that drove the most improvement, in rough order of impact:\n\n`blakpink`\n\n, `newgens`\n\n, `itzy`\n\nvariants).`hoa minzy`\n\nmatching `Hòa Minzy`\n\n).`k pop`\n\nmatching `k-pop`\n\n, `bts v`\n\nmatching `BTS-V`\n\n).`aimyon`\n\nmatching `あいみょん`\n\n), which is a separate post.OpenSearch was the right tool for the typo-tolerance problem, but the migration was as much about the surrounding pieces as the engine itself. The circuit breaker that protects PHP workers from a sick OpenSearch process, the LSCache plus Cloudflare cache layer that absorbs most of the search traffic before queries even hit Lucene, and the per-field analyzer strategy for CJK content are what made it production-viable on a budget LiteSpeed host. SQLite FTS5 still has a job as the fallback and as the source of truth for exact-substring features like channel-scoped search, but for free-form user queries with typos, edit-distance fuzziness is non-negotiable. If you are running a similar multi-language aggregator and your search box still goes straight to FTS5 or LIKE queries, the migration is doable in a couple of weeks with one engineer. Start with the index mapping, build the indexing pipeline, then layer the circuit breaker before cutting traffic over.", "url": "https://wpnews.pro/news/building-typo-tolerant-multi-language-video-search-with-opensearch-and-php", "canonical_source": "https://dev.to/ahmet_gedik778845/building-typo-tolerant-multi-language-video-search-with-opensearch-and-php-3dme", "published_at": "2026-06-03 09:00:01+00:00", "updated_at": "2026-06-03 09:13:39.574046+00:00", "lang": "en", "topics": ["ai-products", "ai-tools", "ai-infrastructure"], "entities": ["OpenSearch", "TopVideoHub", "SQLite", "Cloudflare", "LiteSpeed"], "alternates": {"html": "https://wpnews.pro/news/building-typo-tolerant-multi-language-video-search-with-opensearch-and-php", "markdown": "https://wpnews.pro/news/building-typo-tolerant-multi-language-video-search-with-opensearch-and-php.md", "text": "https://wpnews.pro/news/building-typo-tolerant-multi-language-video-search-with-opensearch-and-php.txt", "jsonld": "https://wpnews.pro/news/building-typo-tolerant-multi-language-video-search-with-opensearch-and-php.jsonld"}}