Building Typo-Tolerant Multi-Language Video Search with OpenSearch and PHP

wpnews.pro

When you run a video aggregator that pulls trending content from nine Asia-Pacific regions every four hours, search becomes the place where users either find what they want or leave for YouTube directly. At TopVideoHub we aggregate clips across Japanese, Korean, Mandarin, Vietnamese, Thai, and English titles in the same index, and our original SQLite FTS5 setup with a CJK tokenizer worked well for exact substring matches but collapsed the moment a user typed blakpink

instead of blackpink

, or searched for Aimyon

when our database stored あいみょん

.

This post walks through the migration we did from SQLite FTS5 to OpenSearch for typo-tolerant search, while keeping FTS5 as a fallback. Running on a budget LiteSpeed host means we could not just throw a five-node Elasticsearch cluster at the problem — we had to be careful about heap pressure, indexing latency, and how Cloudflare caches the search responses.

SQLite FTS5 with the unicode61

tokenizer plus a CJK bigram tokenizer handles tokenization across scripts well enough. Our migration looked like this:

CREATE VIRTUAL TABLE videos_fts USING fts5(
  title,
  channel_name,
  description,
  tokenize = 'unicode61 remove_diacritics 2',
  content = 'videos',
  content_rowid = 'id'
);

This gave us fast prefix and substring matching. The problem is that FTS5 does not implement fuzziness in the Damerau-Levenshtein sense. If a user searched for blakpink

, FTS5 returned zero results because the trigram split (bla

, lak

, akp

, kpi

, pin

, ink

) shared almost no overlap with the indexed blackpink

tokens.

We tried compensating with edge n-grams stored as a denormalized column, then with manual misspelling synonym dictionaries. Both approaches worked for narrow cases but exploded index size and required hand-maintained mappings for every new artist or trending phrase. For CJK content, n-grams of length 1-2 produced massive recall but garbage precision — searching for 新

returned 40,000 plus videos because the character appears in roughly every fourth Chinese title.

The decision point was clear. We needed an engine that supports proper fuzziness with edit distance, language-aware analyzers, and the ability to weight matches across multiple fields. OpenSearch fit the bill, especially because we could run a single-node deployment on a small VPS we already had provisioned for blog ingestion. Self-hosted, no managed-service bill, and the OpenSearch 2.x line ships with the analyzers we needed out of the box.

The first real engineering decision was the index mapping. CJK languages do not use whitespace as a token boundary, so the standard analyzer is useless for Japanese, Chinese, and Korean. OpenSearch ships with the ICU analysis plugin, and there are well-maintained kuromoji

(Japanese), nori

(Korean), and smartcn

(Chinese) plugins.

Rather than maintain one analyzer per language and a router on the application side, we used the multi-fields pattern. The same source text is indexed under several analyzers and we search them with multi_match

. The query engine picks the best match per shard, and we score-weight them on the application side.

Here is the mapping we settled on after about two weeks of tuning:

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "analysis": {
      "analyzer": {
        "title_standard": {
          "type": "custom",
          "tokenizer": "icu_tokenizer",
          "filter": ["icu_folding", "lowercase"]
        },
        "title_edge_ngram": {
          "type": "custom",
          "tokenizer": "icu_tokenizer",
          "filter": ["icu_folding", "lowercase", "edge_ngram_filter"]
        },
        "title_cjk_bigram": {
          "type": "custom",
          "tokenizer": "icu_tokenizer",
          "filter": ["cjk_bigram", "lowercase"]
        }
      },
      "filter": {
        "edge_ngram_filter": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 15
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "video_id": { "type": "keyword" },
      "title": {
        "type": "text",
        "analyzer": "title_standard",
        "fields": {
          "edge": {
            "type": "text",
            "analyzer": "title_edge_ngram",
            "search_analyzer": "title_standard"
          },
          "cjk": {
            "type": "text",
            "analyzer": "title_cjk_bigram"
          },
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "channel_name": { "type": "text", "analyzer": "title_standard" },
      "region":       { "type": "keyword" },
      "category_id":  { "type": "integer" },
      "published_at": { "type": "date" },
      "view_count":   { "type": "long" },
      "duration_seconds": { "type": "integer" }
    }
  }
}

A few design choices worth explaining:

number_of_replicas

to zero because a single-node deployment cannot replicate, and the default of one leaves the cluster in yellow status forever and breaks health-check scripts.icu_folding

handles diacritic stripping for Vietnamese (hóa

becomes hoa

) and width normalization for fullwidth Latin (ＨＥＬＬＯ

becomes hello

).cjk_bigram

filter splits CJK characters into overlapping pairs, so 日本語

indexes as 日本

, 本語

. This is the standard approach when you do not want to ship and version a heavy morphological dictionary.search_analyzer

differs from analyzer

on the edge

subfield. We index with edge n-grams but search without them. Otherwise every search term would also explode into n-grams on the query side, and the relevance scoring would be unusable.Our existing fetch cron pulls from the YouTube Data API every 2-7 hours depending on the site, normalizes the payload, and writes to SQLite. The OpenSearch indexing hooks into the same path as a fire-and-forget bulk step at the end of each fetch cycle. If OpenSearch is unavailable, we log and move on — SQLite remains the source of truth.

<?php

declare(strict_types=1);

namespace App\Search;

final class OpenSearchIndexer
{
    private const ENDPOINT = 'http://127.0.0.1:9200';
    private const INDEX = 'videos_v3';
    private const BATCH_SIZE = 200;

    public function __construct(
        private readonly \PDO $db,
        private readonly \App\Logger $log,
    ) {}

    public function indexBatch(array $videoIds): int
    {
        if ($videoIds === []) {
            return 0;
        }

        $placeholders = implode(',', array_fill(0, count($videoIds), '?'));
        $stmt = $this->db->prepare(
            "SELECT id, video_id, title, channel_name, region,
                    category_id, published_at, view_count, duration_seconds
             FROM videos WHERE id IN ($placeholders)"
        );
        $stmt->execute($videoIds);

        $bulk = '';
        $count = 0;
        while ($row = $stmt->fetch(\PDO::FETCH_ASSOC)) {
            $bulk .= json_encode([
                'index' => [
                    '_index' => self::INDEX,
                    '_id'    => $row['video_id'],
                ],
            ], JSON_THROW_ON_ERROR) . "\n";

            $bulk .= json_encode([
                'video_id'         => $row['video_id'],
                'title'            => $row['title'],
                'channel_name'     => $row['channel_name'],
                'region'           => $row['region'],
                'category_id'      => (int)$row['category_id'],
                'published_at'     => $row['published_at'],
                'view_count'       => (int)$row['view_count'],
                'duration_seconds' => (int)$row['duration_seconds'],
            ], JSON_THROW_ON_ERROR | JSON_UNESCAPED_UNICODE) . "\n";

            $count++;
        }

        $response = $this->postBulk($bulk);
        if ($response === null) {
            $this->log->warn('opensearch.bulk.failed', ['count' => $count]);
            return 0;
        }

        if (!empty($response['errors'])) {
            $this->logBulkErrors($response['items']);
        }

        return $count;
    }

    private function postBulk(string $payload): ?array
    {
        $ch = curl_init(self::ENDPOINT . '/_bulk');
        curl_setopt_array($ch, [
            CURLOPT_POST           => true,
            CURLOPT_POSTFIELDS     => $payload,
            CURLOPT_HTTPHEADER     => ['Content-Type: application/x-ndjson'],
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_TIMEOUT        => 8,
            CURLOPT_CONNECTTIMEOUT => 2,
        ]);
        $body = curl_exec($ch);
        $code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);

        if ($code !== 200 || $body === false) {
            return null;
        }

        return json_decode($body, true, 512, JSON_THROW_ON_ERROR);
    }

    private function logBulkErrors(array $items): void
    {
        $failures = [];
        foreach ($items as $item) {
            $op = $item['index'] ?? $item['create'] ?? null;
            if ($op === null || ($op['status'] ?? 200) < 400) {
                continue;
            }
            $failures[] = [
                'id'     => $op['_id'] ?? null,
                'status' => $op['status'],
                'error'  => $op['error']['reason'] ?? 'unknown',
            ];
        }
        if ($failures !== []) {
            $this->log->warn('opensearch.bulk.items_failed', $failures);
        }
    }
}

The eight-second timeout matters. PHP-FPM under LiteSpeed has a 180-second ceiling, but if OpenSearch is overloaded we want to surface failures fast, not hold the worker. We batch in 200-document chunks because smaller batches give us better backpressure when JVM heap is tight, and the bulk endpoint copes well at this size.

One thing we learned the hard way: do not use the external

versioning mode unless you have a real monotonic version. We tried using published_at

as the version field, but YouTube backfills publishedAt

sometimes when a video gets re-uploaded under the same ID, and the new value was occasionally smaller than the old one. Documents stopped updating silently. Default internal versioning is the safe choice for a feed-style index.

The interesting part is the query side. A user types something — possibly with typos, possibly in CJK, possibly mixed (BTS dynamite live

). We want fuzzy matching for Latin terms, exact bigram matching for CJK, and edge n-gram matching for prefix-as-you-type behavior. All three need to score together so the best result wins.

We use a single multi_match

with best_fields

type, plus a fuzziness

parameter that only kicks in for Latin scripts.

<?php

declare(strict_types=1);

namespace App\Search;

final class OpenSearchQuery
{
    private const ENDPOINT = 'http://127.0.0.1:9200';
    private const INDEX = 'videos_v3';

    public function search(string $term, int $limit = 24, ?string $region = null): array
    {
        $term = trim($term);
        if ($term === '') {
            return [];
        }

        $isCjk = preg_match('/[\x{3000}-\x{9FFF}\x{AC00}-\x{D7AF}]/u', $term) === 1;
        $fuzziness = $isCjk ? '0' : 'AUTO:4,7';

        $query = [
            'size' => $limit,
            'query' => [
                'function_score' => [
                    'query' => [
                        'bool' => [
                            'should' => [
                                [
                                    'multi_match' => [
                                        'query'          => $term,
                                        'fields'         => ['title^3', 'channel_name^1.5'],
                                        'type'           => 'best_fields',
                                        'fuzziness'      => $fuzziness,
                                        'prefix_length'  => 1,
                                        'max_expansions' => 30,
                                    ],
                                ],
                                [
                                    'match' => [
                                        'title.edge' => [
                                            'query' => $term,
                                            'boost' => 2.0,
                                        ],
                                    ],
                                ],
                                [
                                    'match' => [
                                        'title.cjk' => [
                                            'query' => $term,
                                            'boost' => $isCjk ? 4.0 : 0.5,
                                        ],
                                    ],
                                ],
                            ],
                            'minimum_should_match' => 1,
                            'filter' => $this->buildFilters($region),
                        ],
                    ],
                    'functions' => [
                        [
                            'gauss' => [
                                'published_at' => [
                                    'origin' => 'now',
                                    'scale'  => '30d',
                                    'decay'  => 0.5,
                                ],
                            ],
                        ],
                        [
                            'field_value_factor' => [
                                'field'    => 'view_count',
                                'modifier' => 'log1p',
                                'factor'   => 0.2,
                                'missing'  => 0,
                            ],
                        ],
                    ],
                    'score_mode' => 'sum',
                    'boost_mode' => 'multiply',
                ],
            ],
        ];

        return $this->execute($query);
    }

    private function buildFilters(?string $region): array
    {
        $filters = [
            ['range' => ['duration_seconds' => ['gte' => 30, 'lte' => 1800]]],
        ];
        if ($region !== null) {
            $filters[] = ['term' => ['region' => $region]];
        }
        return $filters;
    }

    private function execute(array $query): array
    {
        $ch = curl_init(self::ENDPOINT . '/' . self::INDEX . '/_search');
        curl_setopt_array($ch, [
            CURLOPT_POST           => true,
            CURLOPT_POSTFIELDS     => json_encode($query, JSON_UNESCAPED_UNICODE | JSON_THROW_ON_ERROR),
            CURLOPT_HTTPHEADER     => ['Content-Type: application/json'],
            CURLOPT_RETURNTRANSFER => true,
            CURLOPT_TIMEOUT        => 2,
            CURLOPT_CONNECTTIMEOUT => 1,
        ]);
        $body = curl_exec($ch);
        $code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        curl_close($ch);

        if ($code !== 200 || $body === false) {
            throw new SearchUnavailable('OpenSearch returned ' . $code);
        }

        $decoded = json_decode($body, true, 512, JSON_THROW_ON_ERROR);
        return array_map(
            fn(array $hit) => ['id' => $hit['_id'], 'score' => $hit['_score']] + $hit['_source'],
            $decoded['hits']['hits'] ?? []
        );
    }
}

The key decisions here:

fuzziness: AUTO:4,7

means terms of length 1-3 match exactly, length 4-6 allow one edit, length 7 and up allow two edits. For CJK we disable fuzziness entirely because edit distance on logographic characters returns nonsense — 日本

and 本日

are one transposition apart but mean different things.prefix_length: 1

forces the first character to match, which dramatically reduces the candidate set for short queries that would otherwise expand into thousands of fuzzy matches.max_expansions: 30

caps how many terms each fuzzy query can expand to, protecting us from a pathological single-letter search melting the heap.function_score

decays results by publish date with a 30-day Gaussian half-life and boosts by view count (log1p

keeps mega-viral videos from completely dominating).k-pop

from accidentally matching every Korean clip through romanization noise.duration_seconds

excludes Shorts (under 30s) and full episodes (over 30min), which is a TopVideoHub-specific UX choice.OpenSearch processes crash. The JVM does long GC s. The VPS reboots. In a search UX, going from fuzzy results to no results because the daemon died for 90 seconds is unacceptable. We wrapped the OpenSearch query in a circuit breaker that falls back to SQLite FTS5 on repeated failure.

The pattern is straightforward, but state has to be process-local because we are on PHP — no shared memory unless we want to involve APCu or Redis. A tiny JSON file in /tmp

is enough because LiteSpeed pins the PHP worker pool to a single host.

<?php

declare(strict_types=1);

namespace App\Search;

final class SearchService
{
    private const CIRCUIT_FILE = '/tmp/opensearch_circuit.json';
    private const FAILURE_THRESHOLD = 3;
    private const COOLDOWN_SECONDS = 60;

    public function __construct(
        private readonly OpenSearchQuery $primary,
        private readonly Fts5Search $fallback,
        private readonly \App\Logger $log,
    ) {}

    public function search(string $term, int $limit = 24, ?string $region = null): array
    {
        if ($this->circuitOpen()) {
            return $this->fallback->search($term, $limit, $region);
        }

        try {
            $results = $this->primary->search($term, $limit, $region);
            $this->recordSuccess();
            return $results;
        } catch (SearchUnavailable $e) {
            $this->recordFailure();
            $this->log->warn('search.fallback', ['error' => $e->getMessage()]);
            return $this->fallback->search($term, $limit, $region);
        }
    }

    private function circuitOpen(): bool
    {
        $state = $this->readState();
        if ($state['failures'] < self::FAILURE_THRESHOLD) {
            return false;
        }
        return (time() - $state['last_failure']) < self::COOLDOWN_SECONDS;
    }

    private function recordSuccess(): void
    {
        $this->writeState(['failures' => 0, 'last_failure' => 0]);
    }

    private function recordFailure(): void
    {
        $state = $this->readState();
        $this->writeState([
            'failures'     => $state['failures'] + 1,
            'last_failure' => time(),
        ]);
    }

    private function readState(): array
    {
        if (!is_file(self::CIRCUIT_FILE)) {
            return ['failures' => 0, 'last_failure' => 0];
        }
        $raw = @file_get_contents(self::CIRCUIT_FILE);
        if ($raw === false) {
            return ['failures' => 0, 'last_failure' => 0];
        }
        return json_decode($raw, true) ?: ['failures' => 0, 'last_failure' => 0];
    }

    private function writeState(array $state): void
    {
        @file_put_contents(self::CIRCUIT_FILE, json_encode($state), LOCK_EX);
    }
}

After three consecutive failures we stop hitting OpenSearch for 60 seconds and route everything to FTS5. After the cooldown we let one request through; if it succeeds, the counter resets. The fallback FTS5 path returns visibly fewer results for typo-heavy queries, but it always returns something, which is the important property.

The search endpoint gets hammered — roughly 8% of pageviews. We could not let every search query reach origin. But search results are user-typed input, so caching the wrong response is a real risk if Vary

is wrong.

The approach was three layers:

Cache-Control: public, max-age=600, stale-while-revalidate=1800

on search responses with a non-empty query string.<IfModule LiteSpeed>

block adds CacheLookup public on

for /search

, so LSCache kicks in before PHP even boots.Cloudflare Free does not cache HTML by default. We added a Cache Rule for URI Path starts with "/search"

to set Cache Level to Cache Everything with TTL 600. Combined with Vary: Cookie

scoped to only the region cookie, this works without leaking sessions across users.

The gotcha was personalization: logged-in users see watch-later markers on result cards, and those would poison the cache. We solved this by rendering the user-specific overlay client-side from a tiny /api/watchlater.json

endpoint that the cache layer skips. Nothing about the cached HTML body is user-specific.

After two weeks of running OpenSearch alongside FTS5 with traffic split 90/10 to OpenSearch, the results were:

The fuzzy queries that drove the most improvement, in rough order of impact:

blakpink

, newgens

, itzy

variants).hoa minzy

matching Hòa Minzy

).k pop

matching k-pop

, bts v

matching BTS-V

).aimyon

matching あいみょん

), which is a separate post.OpenSearch was the right tool for the typo-tolerance problem, but the migration was as much about the surrounding pieces as the engine itself. The circuit breaker that protects PHP workers from a sick OpenSearch process, the LSCache plus Cloudflare cache layer that absorbs most of the search traffic before queries even hit Lucene, and the per-field analyzer strategy for CJK content are what made it production-viable on a budget LiteSpeed host. SQLite FTS5 still has a job as the fallback and as the source of truth for exact-substring features like channel-scoped search, but for free-form user queries with typos, edit-distance fuzziness is non-negotiable. If you are running a similar multi-language aggregator and your search box still goes straight to FTS5 or LIKE queries, the migration is doable in a couple of weeks with one engineer. Start with the index mapping, build the indexing pipeline, then layer the circuit breaker before cutting traffic over.

source & further reading

dev.to — original article Is your agent's grep tool a shell command? Build a Pear Protocol DeFi trading agent with MCP Who Fixes Broken Lovable and Bolt Apps? Your Real Options Compared

Building Typo-Tolerant Multi-Language Video Search with OpenSearch and PHP

Run your AI side-project on zahid.host