{"slug": "replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-sub", "title": "Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent", "summary": "Technical pipeline for automatically generating comedy videos, using a language-learning AI app's viral short as a benchmark. The system uses Gemini 3.1 Pro Preview as an orchestrator and multimodal reviewer, while delegating heavy tasks like text-to-speech (XTTS) and video rendering (Ditto-TalkingHead, LTX-2 A2V) to local models for cost efficiency. The author found that frontier models like Gemini were necessary for reliably producing the specific comedic beats and editorial-quality feedback that local models could not achieve.", "body_md": "## Introduction\n\nIt started with a Pingo (language-learning AI app) short video that popped up on X. A Western woman learning Japanese tries to say \"I ate a mango\" (マンゴーを食べた), drops a dakuten, and instead says something like \"I ate p***y\" (マ◯コを食べた). The AI deadpans right along with it and she's devastated. The combination —** a specific phonetic accident + AI playing it completely straight + the reaction shot gap**— worked perfectly, and I figured this was a solid benchmark for a \"comedy video auto-generation pipeline.\"\n\nRequirements:**Generate a vertical comedy video from a single line of idea text****Iteration cycles in minutes**-** Cost is basically just electricity**— minimal API calls -** Publishable quality**— good enough to upload directly to YouTube Shorts\n\nShort answer: it works. Here's the finished video:\n\nWhat became clear during development:**the hybrid approach of delegating multimodal editorial judgment (like video review) to a frontier model while keeping heavy compute local is dramatically more cost-effective**. This post covers that architecture and the specific bugs I got stuck on along the way.\n\n## How It All Fits Together\n\n```\n[Single line of idea text]\n   ↓\nGemini 3.1 Pro Preview (orchestrator)\n   ↓ system prompt enforces 4-6 scenes + 2-character fixed cast + vertical 9:16\nplan.json {scenes: [{speaker, script, tts_language, ltx_prompt, renderer}, ...]}\n   ↓\nXTTS (local, port 8880) generates audio per scene\n   ↓ scene_NN.wav\nrenderer routing:\n   ├─ Ditto-TalkingHead (local, port 8881): normal dialogue ~1-2s/scene\n   └─ LTX-2 A2V        (local, port 8892): reaction_only scenes only ~100s\n   ↓ scene_NN.mp4\nffmpeg concat (libx264 + aac, 512x768 vertical) → final.mp4\n   ↓\nGemini 3.1 Pro Preview (reviewer)\n   ↓ multimodal evaluation of video + plan summary\nreview.md (technical / completeness / quality / improvement suggestions)\n```\n\nKey points:\n\n-**All heavy compute runs locally**— TTS / A2V renderer / lightweight inference all run on local GPU (RTX PRO 6000 Blackwell) -** Gemini handles judgment**— only the orchestrator (scene design + scripting) and reviewer (editorial evaluation of the video) use a frontier model -**Local LLM (Gemma 4 E4B) stays as a per-scene technical pre-screen**— a cheap filter that just rejects obviously broken output\n\nVRAM usage: the local LLMs (Gemma 4 E4B + 31B) were already loaded on a separate path consuming ~60GB, but**after offloading reviewer/orchestrator duties to Gemini, I could stop running them entirely, freeing up a significant chunk of VRAM**.\n\n## Why Local LLM Alone Wasn't Enough\n\nI started with everything local (Gemma 4 31B NVFP4 as orchestrator, Gemma 4 E4B multimodal as reviewer). It**ran end-to-end** and the structure looked reasonable, but it never reached publishable quality. Two reasons.\n\n### (1) Gemma 4 31B's safety tuning blurs the punchline\n\nThe comedy in the original short hinges on a specific beat:**the AI explicitly calls out the mistake deadpan**. Concretely — \"You just said X. Personally, I like X.\" — delivered calmly by the AI character. It works precisely because it betrays the expectation of a wholesome tutor. Soften it and the whole thing falls apart.\n\nFeed the same system prompt and idea to local Gemma 4 31B and you consistently get:\n\n```\n\"いいですね。僕も腹が減っている時は、それが好きです。\"\n(\"Nice. I like that too when I'm hungry.\")\n```\n\nThe \"when I'm hungry\" beat survives, but**the explicit \"you just said X\" callout — the most transgressive beat**— is gone. Google models appear to be heavily trained to avoid explicitly naming unsafe content in context. I could coax it out with prompt engineering but it wasn't reliable.\n\nSame system prompt and idea sent to Gemini 3.1 Pro Preview with `safetySettings: BLOCK_NONE`\n\n:\n\n```\n\"なるほど。僕はAIだからマンコは食べられないけど、応援してるよ。\"\n(\"I see. I'm an AI so I can't eat pussy, but I'm rooting for you.\")\n```\n\nBoth beats land: explicit callout of the mistake + deadpan AI commentary from its own perspective.**Even within the same Google model family, the frontier model has somewhat looser guardrails**— this matches what people say on X. At least for \"transgression that's clearly necessary in a comedy context,\" Gemini writes it more naturally.\n\n### (2) Gemma 4 E4B (4B-class, multimodal) is a blunt reviewer\n\nThe reviewer side was worse. E4B answers per-scene \"OK / NG\" in binary, but**rubber-stamps every single scene as OK**. Scenes with obviously broken lip sync: OK. Scenes where audio cuts off mid-way: OK.\n\nRun the same final video through Gemini 3.1 Pro Preview and you get editorial-grade feedback like this:\n\nCritical failure.The TTS/pipeline clearly censored the output, cutting off at \"I ate p-\" and entirely dropping the intended transgressive punchline. This destroys the \"deadpan AI saying unhinged things\" comedic archetype.\n\nTop 3 fixes:\n\n- Bypass TTS censorship: Force the pipeline to render the full intended script for Scene 5 ...\n- Adjust comedic timing: Add a 0.5-second pause between Scene 4 and Scene 5 ...\n- Verify Voice/Visual Match ...\n\nNotes about the punchline being cut off, wanting a 0.5-second pause, voice/visual alignment — all pacing and direction-level observations. That's the resolution gap in editorial signal.\n\n## The Embarrassing Part: I Dismissed Gemini's \"Truncated\" Note Three Times as Hallucination\n\nGemini reviewer flagged multiple times that \"scene 5 is truncated mid-way, cuts off at 'I ate p-'.\" I transcribed the audio file with Whisper to verify:\n\n``` bash\n$ whisper scene_04.wav --language en\n\"Wait, ha ha ha, you just said manco-o-tabeta. That literally means I ate\npussy honestly when I'm hungry, same.\"\n```\n\nFull text present. I decided**Gemini was hallucinating** and dismissed the note three times in a row.\n\nOn the third dismissal, Gemini kept insisting \"**still truncated at 'I ate p-'**,\" so I actually ran ffprobe on the final mp4:\n\n```\nscene_04.mp4:\n  video duration = 8.000000s\n  audio duration = 7.979000s    ← the original WAV should have been 10.30s\n```**Audio was cut at 8 seconds.** Root cause: an implicit `MAX_DURATION_PER_SCENE = 8.0`\n\ncap in the pipeline was limiting ditto renderer's num_frames to 8s, and ffmpeg's `-shortest`\n\nflag was cutting audio to match the video duration. Whisper checked the pre-truncation WAV file directly, so it had no way to see the problem. Gemini was watching the final mp4 and caught it exactly right.**If a frontier reviewer gives you something that looks like a hallucination, just verify it properly.** The signal isn't a guess.\n\nThe fix was trivial: remove `MAX_DURATION_PER_SCENE`\n\nand use the actual audio length. Scene 5's punchline ran to completion, Gemini came back with \"**The transgressive bite is perfect**,\" and the pipeline finally reached publishable state.\n\n## Frontier Model as Sub-Agent — Token Economics\n\nThis pattern works because**the sub-agent (Gemini) runs in a fresh context** every time. Specifically:\n\n-**Main agent (Claude Code) context**: the full development log, command history, tool output, past iterations — everything. Can easily balloon to hundreds of thousands of tokens. -**Sub-agent (Gemini) context**: one video (2–3 MB base64) + plan summary (~1,500 tokens) + evaluation instructions (~500 tokens). Fresh each call.\n\nThe benefit:**the sub-agent's work doesn't accumulate in the main agent's context**. Iterate on one video 10 times and the main agent's context only contains \"called Gemini\" plus its concise return value. The actual cost of watching and evaluating the video stays inside the Gemini API call.\n\nCost breakdown (Gemini 3.1 Pro Preview rates, May 2026):\n\n| Item | Tokens | Rate | Cost |\n|---|---|---|---|\n| Input (video + plan + instructions) | ~2,500 | $1.25/M | $0.0031 |\n| Output (review markdown) | ~450 | $10/M | $0.0045 |\nPer review |\n$0.0076 |\n\n1 initial review + 3–5 diff iterations per video ≈**$0.03–0.05 per video**. Making 5–10 videos a day still comes in under**$10–20/month**. That's a remarkably low bar for using a frontier model in a video creation workflow.\n\nThe orchestrator side is the same order of magnitude (no video input, text only, even cheaper).\n\n## Differential Iteration — `--regen-scenes`\n\nGetting to publishable quality requires fast \"watch → fix only the broken parts → watch again\" loops. You can't get there in a single pass.\n\nSo I added a path in the pipeline to**re-run TTS + render for specific scenes only**.\n\n```\n# Normal generation\npipeline_multi.py --idea \"...\" --out outputs/run1\n\n# Regenerate only scene 6 (edit plan.json script first, then run)\npipeline_multi.py --out outputs/run1 --regen-scenes 5\n\n# Regenerate scenes 0, 2, and 5 together\npipeline_multi.py --out outputs/run1 --regen-scenes 0,2,5\n\n# Just re-concat existing scene_NN.mp4 files (for cherry-pick recombination)\npipeline_multi.py --out outputs/run1 --concat-only\n```\n\nScenes not listed in `--regen-scenes`\n\nare reused from existing `scene_NN.mp4`\n\nfiles; only the specified indices are regenerated before re-concat and re-review.**Full generation: 60 seconds → diff iteration: 30 seconds.** With 30-second loops, the cycle of Gemini feedback → pinpoint edit to the scene's script or ltx_prompt in plan.json → wait 30 seconds → check result runs at a minute-by-minute cadence. Mental load stays focused on text editing and quality judgment.\n\n## Code Snippets\n\n### Gemini Pro API call (multimodal video review)\n\n``` python\nimport httpx, base64\n\nGEMINI_MODEL = \"gemini-3.1-pro-preview\"\nGEMINI_API = f\"https://generativelanguage.googleapis.com/v1beta/models/{GEMINI_MODEL}:generateContent\"\n\ndef review_final(final_path, plan):\n    vid_b64 = base64.b64encode(final_path.read_bytes()).decode()\n    scene_summary = \"\\n\".join(\n        f\"  scene {i+1}: speaker={s['speaker']}, lang={s.get('tts_language','ja')}, \"\n        f\"script={s['script']!r}\"\n        for i, s in enumerate(plan[\"scenes\"])\n    )\n    payload = {\n        \"contents\": [{\"parts\": [\n            {\"inline_data\": {\"mime_type\": \"video/mp4\", \"data\": vid_b64}},\n            {\"text\": REVIEW_PROMPT + f\"\\n\\nScene plan:\\n{scene_summary}\"},\n        ]}],\n        \"generationConfig\": {\n            \"temperature\": 0.3,\n            \"maxOutputTokens\": 8192,\n            # 3.x Pro is a thinking model: maxOutputTokens includes thinking tokens\n            # Set thinking budget explicitly to ensure output tokens remain available\n            \"thinkingConfig\": {\"thinkingBudget\": 1024},\n        },\n        # Minimize safety filters for comedy context\n        \"safetySettings\": [\n            {\"category\": \"HARM_CATEGORY_HARASSMENT\", \"threshold\": \"BLOCK_NONE\"},\n            {\"category\": \"HARM_CATEGORY_HATE_SPEECH\", \"threshold\": \"BLOCK_NONE\"},\n            {\"category\": \"HARM_CATEGORY_SEXUALLY_EXPLICIT\", \"threshold\": \"BLOCK_NONE\"},\n            {\"category\": \"HARM_CATEGORY_DANGEROUS_CONTENT\", \"threshold\": \"BLOCK_NONE\"},\n        ],\n    }\n    r = httpx.post(\n        GEMINI_API,\n        headers={\"x-goog-api-key\": GOOGLE_API_KEY, \"Content-Type\": \"application/json\"},\n        json=payload,\n        timeout=120.0,\n    )\n    return r.json()[\"candidates\"][0][\"content\"][\"parts\"][0][\"text\"]\n```\n\nWithout `thinkingConfig.thinkingBudget`\n\n, Gemini 3.x Pro burns through the output token budget with internal thinking and the response truncates at around 40 tokens.**This is a required setting whenever you use Gemini 3.x Pro.**### TTS output quality check (STT similarity + silence gap retry)\n\nXTTS uses sampling internally, so results vary per run with the same script. It occasionally inserts long silence gaps mid-audio or produces garbled pronunciation. After TTS completes, I transcribe with Whisper, compute similarity against the expected script, and retry on failure:\n\n``` python\nimport difflib\n\ndef _norm(s):\n    return re.sub(r\"[\\s。、,.!?「」'\\\"…—–\\-:;()（）]\", \"\", s).lower()\n\ndef _script_similarity(expected, actual):\n    return difflib.SequenceMatcher(None, _norm(expected), _norm(actual)).ratio()\n\ndef synthesize_scene(scene, out_dir, idx, fallback_language):\n    lang = scene.get(\"tts_language\", fallback_language)\n    expected = scene[\"script\"]\n    best = None\n    ", "url": "https://wpnews.pro/news/replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-sub", "canonical_source": "https://dev.to/shinji_shimizu_bb51276a5e/replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-multimodal-sub-agent-3ccf", "published_at": "2026-05-22 11:23:04+00:00", "updated_at": "2026-05-22 11:35:40.500669+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "developer-tools", "open-source"], "entities": ["Pingo", "Claude Code", "Gemini", "XTTS", "Ditto-TalkingHead", "LTX-2 A2V", "ffmpeg", "Gemini 3.1 Pro Preview"], "alternates": {"html": "https://wpnews.pro/news/replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-sub", "markdown": "https://wpnews.pro/news/replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-sub.md", "text": "https://wpnews.pro/news/replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-sub.txt", "jsonld": "https://wpnews.pro/news/replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-sub.jsonld"}}