{"slug": "sweeping-i18n-leaks-with-four-parallel-ai-agents-from-300-candidates-down-to-60", "title": "Sweeping i18n leaks with four parallel AI agents — from 300 candidates down to 60 real bugs", "summary": "A developer used four parallel AI investigation agents and AST-based false-positive filtering to detect hardcoded Japanese text in a bilingual codebase, reducing 300 candidates to 60 real bugs. The cleanup uncovered that English-paying users had been receiving Japanese emails from the Stripe webhook for months. The approach combined parallel AI agents for different code areas with a Python script using the AST module to filter out false positives from language branches.", "body_md": "For any app past a certain size that's gone bilingual, the question **\"how much hardcoded Japanese is still hiding in our repo?\"** never quite goes away. A naive grep for `[ぁ-んァ-ヶ一-龯]`\n\nreturns thousands of hits, and the vast majority are inside translation tables, already-branched code, or comments. The real leaks are buried.\n\nFor one cleanup pass we attacked this with **four parallel AI investigation agents plus AST-based false-positive filtering**. The result: **~300 candidates detected → ~60 real leaks → cleaned up across five rounds**. This post walks through the flow and the most interesting bug it uncovered — paying English users had been getting Japanese email from the Stripe webhook for months.\n\nA repository-wide grep returns thousands of hits, but the contents fall into four bins: **translation tables / already branched by lang == 'en' / comments and docstrings / real leaks**. The first three are harmless. Only the last shows Japanese to English users. The trouble is that grep can't separate them, and the volume is too high for a human to triage one by one.\n\nThe approach: **launch AI investigation agents in parallel with each one assigned a different surface area**.\n\n```\n[Agent 1] templates/*.html + lang/*.json    — data-i18n attribute gaps\n[Agent 2] server/wpmm-license/*.php          — license API\n[Agent 3] server/wpmm-web/*.php               — landing-page API\n[Agent 4] core/*.py + tools/*.py             — desktop app code\n```\n\nEach agent gets the same prompt template — \"enumerate user-facing JP hardcodes, decide as best you can whether each is already branched\" — and runs **independently**. Parallelism keeps wall-clock time below a single-agent run, and having four perspectives on the same kind of problem improves coverage.\n\nThe merged report came in around **300 candidates**. Still noisy.\n\nHidden in those 300 were heavy false-positive clusters:\n\n| Location | Count | Why it's a false positive |\n|---|---|---|\n`templates/tos.html` |\n63 |\n`tosJa` / `tosEn` blocks both exist; `switchLang` toggles them |\n`core/report_generator.py` |\n141 | All inside `if lang == 'en'` branches or `_JA / _EN` variant maps |\n\nGoing through 200 items by hand wasn't realistic. Instead, we wrote a Python script using the `ast`\n\nmodule to **mechanically decide \"does this function have a lang branch around the JP literal?\"** A sketch:\n\n``` python\nimport ast\n\ndef has_lang_branch(func_node):\n    \"\"\"Does this function use `lang` in a conditional?\"\"\"\n    for node in ast.walk(func_node):\n        if isinstance(node, ast.If):\n            for sub in ast.walk(node.test):\n                if isinstance(sub, ast.Name) and sub.id == 'lang':\n                    return True\n    return False\n\ndef has_jp_literal(func_node):\n    \"\"\"Any Constant string node containing Japanese characters?\"\"\"\n    for node in ast.walk(func_node):\n        if isinstance(node, ast.Constant) and isinstance(node.value, str):\n            if any('぀' <= c <= '鿿' for c in node.value):\n                return True\n    return False\n\n# A real leak = has JP literal AND no lang branch\nreal_leaks = [f for f in functions\n              if has_jp_literal(f) and not has_lang_branch(f)]\n```\n\nRunning this against the 141 in `report_generator.py`\n\ngave **essentially zero real leaks** (the one residual hit was a docstring false positive). The 63 in `tos.html`\n\nwere also fully cleared by checking DOM structure + the presence of `switchLang`\n\n.\n\nNet: about **60 real leaks**, finally a tractable pile.\n\nInside those 60 was the largest single impact: **all four Stripe-webhook emails (purchase complete, renewal, payment failed, plan change) were hardcoded to Japanese**. English-paying users had been getting purchase confirmations, failure notices, everything in Japanese. The kind of bug that quietly persists forever unless you go looking for it.\n\nThe fix was a one-function language inference from the Stripe event:\n\n```\n/** Infer display language from Stripe event currency. */\nfunction lang_from_currency(string $currency): string {\n    $en_currencies = ['usd'];\n    return in_array(strtolower($currency), $en_currencies, true) ? 'en' : 'ja';\n}\n```\n\nThis `$lang`\n\nthen gets passed into `send_license_email`\n\n/ `send_payment_failed_email`\n\n/ `send_plan_changed_email`\n\n/ `send_renewal_email`\n\n, branching the subject and body, and **switching mb_language('uni'|'Japanese')** so English subjects are UTF-8 Base64 encoded instead of ISO-2022-JP. Subject encoding is small but real:\n\n`mb_language('Japanese')`\n\nwas MIME-encoding English subjects in ISO-2022-JP, which raises spam scores on Gmail and Outlook.On the license API side, we consolidated all language detection into one helper:\n\n```\n// server/wpmm-license/lib/i18n_helpers.php\nfunction resolve_request_lang(?array $body = null): string {\n    if (isset($body['language']) && in_array($body['language'], ['ja','en'], true)) {\n        return $body['language'];\n    }\n    // Accept-Language fallback\n    if (preg_match('/^en\\b/i', $_SERVER['HTTP_ACCEPT_LANGUAGE'] ?? '')) {\n        return 'en';\n    }\n    return 'ja';\n}\n```\n\n`validate.php`\n\n/ `release_machine.php`\n\n/ `webhook.php`\n\n/ `verify_email.php`\n\nnow all `require_once`\n\nthis and call `resolve_request_lang()`\n\ninstead of rolling their own. An English plan-name table (`PLAN_NAMES_EN`\n\n) lives in the same file, so `plan_name($code, $lang)`\n\nbecomes the single point of truth.\n\nThe remaining real leaks were similar in shape: `core/license.py`\n\n, `core/key_perms.py`\n\n, the desktop launchers (`_launcher.sh`\n\n/ `.ps1`\n\n), and the landing-page APIs (`checkout.php`\n\n/ `chat.php`\n\n/ `rate.php`\n\n). All got the same treatment — extract a small helper, branch on language, route everything through one entry point.\n\nThree principles worth keeping from this round:\n\n`lang_from_currency`\n\n, `resolve_request_lang`\n\n, `plan_name`\n\n) makes adding a new API naturally route through the same path. The \"uh, I forgot to branch\" failure mode becomes structurally harderThe fear \"how much Japanese is still hardcoded in our repo?\" doesn't fully go away — but with a parallel-agents + AST pipeline in your toolkit, you can at least quantify it on demand instead of carrying it as a vague anxiety.", "url": "https://wpnews.pro/news/sweeping-i18n-leaks-with-four-parallel-ai-agents-from-300-candidates-down-to-60", "canonical_source": "https://dev.to/susumun/sweeping-i18n-leaks-with-four-parallel-ai-agents-from-300-candidates-down-to-60-real-bugs-4mgc", "published_at": "2026-06-29 00:58:53+00:00", "updated_at": "2026-06-29 01:27:25.204116+00:00", "lang": "en", "topics": ["artificial-intelligence", "developer-tools", "natural-language-processing"], "entities": ["Stripe", "AST"], "alternates": {"html": "https://wpnews.pro/news/sweeping-i18n-leaks-with-four-parallel-ai-agents-from-300-candidates-down-to-60", "markdown": "https://wpnews.pro/news/sweeping-i18n-leaks-with-four-parallel-ai-agents-from-300-candidates-down-to-60.md", "text": "https://wpnews.pro/news/sweeping-i18n-leaks-with-four-parallel-ai-agents-from-300-candidates-down-to-60.txt", "jsonld": "https://wpnews.pro/news/sweeping-i18n-leaks-with-four-parallel-ai-agents-from-300-candidates-down-to-60.jsonld"}}