Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent

Technical pipeline for automatically generating comedy videos, using a language-learning AI app's viral short as a benchmark. The system uses Gemini 3.1 Pro Preview as an orchestrator and multimodal reviewer, while delegating heavy tasks like text-to-speech (XTTS) and video rendering (Ditto-TalkingHead, LTX-2 A2V) to local models for cost efficiency. The author found that frontier models like Gemini were necessary for reliably producing the specific comedic beats and editorial-quality feedback that local models could not achieve.

Introduction It started with a Pingo language-learning AI app short video that popped up on X. A Western woman learning Japanese tries to say "I ate a mango" マンゴーを食べた , drops a dakuten, and instead says something like "I ate p y" マ◯コを食べた . The AI deadpans right along with it and she's devastated. The combination — a specific phonetic accident + AI playing it completely straight + the reaction shot gap — worked perfectly, and I figured this was a solid benchmark for a "comedy video auto-generation pipeline." Requirements: Generate a vertical comedy video from a single line of idea text Iteration cycles in minutes - Cost is basically just electricity — minimal API calls - Publishable quality — good enough to upload directly to YouTube Shorts Short answer: it works. Here's the finished video: What became clear during development: the hybrid approach of delegating multimodal editorial judgment like video review to a frontier model while keeping heavy compute local is dramatically more cost-effective . This post covers that architecture and the specific bugs I got stuck on along the way. How It All Fits Together Single line of idea text ↓ Gemini 3.1 Pro Preview orchestrator ↓ system prompt enforces 4-6 scenes + 2-character fixed cast + vertical 9:16 plan.json {scenes: {speaker, script, tts language, ltx prompt, renderer}, ... } ↓ XTTS local, port 8880 generates audio per scene ↓ scene NN.wav renderer routing: ├─ Ditto-TalkingHead local, port 8881 : normal dialogue ~1-2s/scene └─ LTX-2 A2V local, port 8892 : reaction only scenes only ~100s ↓ scene NN.mp4 ffmpeg concat libx264 + aac, 512x768 vertical → final.mp4 ↓ Gemini 3.1 Pro Preview reviewer ↓ multimodal evaluation of video + plan summary review.md technical / completeness / quality / improvement suggestions Key points: - All heavy compute runs locally — TTS / A2V renderer / lightweight inference all run on local GPU RTX PRO 6000 Blackwell - Gemini handles judgment — only the orchestrator scene design + scripting and reviewer editorial evaluation of the video use a frontier model - Local LLM Gemma 4 E4B stays as a per-scene technical pre-screen — a cheap filter that just rejects obviously broken output VRAM usage: the local LLMs Gemma 4 E4B + 31B were already loaded on a separate path consuming ~60GB, but after offloading reviewer/orchestrator duties to Gemini, I could stop running them entirely, freeing up a significant chunk of VRAM . Why Local LLM Alone Wasn't Enough I started with everything local Gemma 4 31B NVFP4 as orchestrator, Gemma 4 E4B multimodal as reviewer . It ran end-to-end and the structure looked reasonable, but it never reached publishable quality. Two reasons. 1 Gemma 4 31B's safety tuning blurs the punchline The comedy in the original short hinges on a specific beat: the AI explicitly calls out the mistake deadpan . Concretely — "You just said X. Personally, I like X." — delivered calmly by the AI character. It works precisely because it betrays the expectation of a wholesome tutor. Soften it and the whole thing falls apart. Feed the same system prompt and idea to local Gemma 4 31B and you consistently get: "いいですね。僕も腹が減っている時は、それが好きです。" "Nice. I like that too when I'm hungry." The "when I'm hungry" beat survives, but the explicit "you just said X" callout — the most transgressive beat — is gone. Google models appear to be heavily trained to avoid explicitly naming unsafe content in context. I could coax it out with prompt engineering but it wasn't reliable. Same system prompt and idea sent to Gemini 3.1 Pro Preview with safetySettings: BLOCK NONE : "なるほど。僕はAIだからマンコは食べられないけど、応援してるよ。" "I see. I'm an AI so I can't eat pussy, but I'm rooting for you." Both beats land: explicit callout of the mistake + deadpan AI commentary from its own perspective. Even within the same Google model family, the frontier model has somewhat looser guardrails — this matches what people say on X. At least for "transgression that's clearly necessary in a comedy context," Gemini writes it more naturally. 2 Gemma 4 E4B 4B-class, multimodal is a blunt reviewer The reviewer side was worse. E4B answers per-scene "OK / NG" in binary, but rubber-stamps every single scene as OK . Scenes with obviously broken lip sync: OK. Scenes where audio cuts off mid-way: OK. Run the same final video through Gemini 3.1 Pro Preview and you get editorial-grade feedback like this: Critical failure.The TTS/pipeline clearly censored the output, cutting off at "I ate p-" and entirely dropping the intended transgressive punchline. This destroys the "deadpan AI saying unhinged things" comedic archetype. Top 3 fixes: - Bypass TTS censorship: Force the pipeline to render the full intended script for Scene 5 ... - Adjust comedic timing: Add a 0.5-second pause between Scene 4 and Scene 5 ... - Verify Voice/Visual Match ... Notes about the punchline being cut off, wanting a 0.5-second pause, voice/visual alignment — all pacing and direction-level observations. That's the resolution gap in editorial signal. The Embarrassing Part: I Dismissed Gemini's "Truncated" Note Three Times as Hallucination Gemini reviewer flagged multiple times that "scene 5 is truncated mid-way, cuts off at 'I ate p-'." I transcribed the audio file with Whisper to verify: bash $ whisper scene 04.wav --language en "Wait, ha ha ha, you just said manco-o-tabeta. That literally means I ate pussy honestly when I'm hungry, same." Full text present. I decided Gemini was hallucinating and dismissed the note three times in a row. On the third dismissal, Gemini kept insisting " still truncated at 'I ate p-' ," so I actually ran ffprobe on the final mp4: scene 04.mp4: video duration = 8.000000s audio duration = 7.979000s ← the original WAV should have been 10.30s Audio was cut at 8 seconds. Root cause: an implicit MAX DURATION PER SCENE = 8.0 cap in the pipeline was limiting ditto renderer's num frames to 8s, and ffmpeg's -shortest flag was cutting audio to match the video duration. Whisper checked the pre-truncation WAV file directly, so it had no way to see the problem. Gemini was watching the final mp4 and caught it exactly right. If a frontier reviewer gives you something that looks like a hallucination, just verify it properly. The signal isn't a guess. The fix was trivial: remove MAX DURATION PER SCENE and use the actual audio length. Scene 5's punchline ran to completion, Gemini came back with " The transgressive bite is perfect ," and the pipeline finally reached publishable state. Frontier Model as Sub-Agent — Token Economics This pattern works because the sub-agent Gemini runs in a fresh context every time. Specifically: - Main agent Claude Code context : the full development log, command history, tool output, past iterations — everything. Can easily balloon to hundreds of thousands of tokens. - Sub-agent Gemini context : one video 2–3 MB base64 + plan summary ~1,500 tokens + evaluation instructions ~500 tokens . Fresh each call. The benefit: the sub-agent's work doesn't accumulate in the main agent's context . Iterate on one video 10 times and the main agent's context only contains "called Gemini" plus its concise return value. The actual cost of watching and evaluating the video stays inside the Gemini API call. Cost breakdown Gemini 3.1 Pro Preview rates, May 2026 : | Item | Tokens | Rate | Cost | |---|---|---|---| | Input video + plan + instructions | ~2,500 | $1.25/M | $0.0031 | | Output review markdown | ~450 | $10/M | $0.0045 | Per review | $0.0076 | 1 initial review + 3–5 diff iterations per video ≈ $0.03–0.05 per video . Making 5–10 videos a day still comes in under $10–20/month . That's a remarkably low bar for using a frontier model in a video creation workflow. The orchestrator side is the same order of magnitude no video input, text only, even cheaper . Differential Iteration — --regen-scenes Getting to publishable quality requires fast "watch → fix only the broken parts → watch again" loops. You can't get there in a single pass. So I added a path in the pipeline to re-run TTS + render for specific scenes only . Normal generation pipeline multi.py --idea "..." --out outputs/run1 Regenerate only scene 6 edit plan.json script first, then run pipeline multi.py --out outputs/run1 --regen-scenes 5 Regenerate scenes 0, 2, and 5 together pipeline multi.py --out outputs/run1 --regen-scenes 0,2,5 Just re-concat existing scene NN.mp4 files for cherry-pick recombination pipeline multi.py --out outputs/run1 --concat-only Scenes not listed in --regen-scenes are reused from existing scene NN.mp4 files; only the specified indices are regenerated before re-concat and re-review. Full generation: 60 seconds → diff iteration: 30 seconds. With 30-second loops, the cycle of Gemini feedback → pinpoint edit to the scene's script or ltx prompt in plan.json → wait 30 seconds → check result runs at a minute-by-minute cadence. Mental load stays focused on text editing and quality judgment. Code Snippets Gemini Pro API call multimodal video review python import httpx, base64 GEMINI MODEL = "gemini-3.1-pro-preview" GEMINI API = f"https://generativelanguage.googleapis.com/v1beta/models/{GEMINI MODEL}:generateContent" def review final final path, plan : vid b64 = base64.b64encode final path.read bytes .decode scene summary = "\n".join f" scene {i+1}: speaker={s 'speaker' }, lang={s.get 'tts language','ja' }, " f"script={s 'script' r}" for i, s in enumerate plan "scenes" payload = { "contents": {"parts": {"inline data": {"mime type": "video/mp4", "data": vid b64}}, {"text": REVIEW PROMPT + f"\n\nScene plan:\n{scene summary}"}, } , "generationConfig": { "temperature": 0.3, "maxOutputTokens": 8192, 3.x Pro is a thinking model: maxOutputTokens includes thinking tokens Set thinking budget explicitly to ensure output tokens remain available "thinkingConfig": {"thinkingBudget": 1024}, }, Minimize safety filters for comedy context "safetySettings": {"category": "HARM CATEGORY HARASSMENT", "threshold": "BLOCK NONE"}, {"category": "HARM CATEGORY HATE SPEECH", "threshold": "BLOCK NONE"}, {"category": "HARM CATEGORY SEXUALLY EXPLICIT", "threshold": "BLOCK NONE"}, {"category": "HARM CATEGORY DANGEROUS CONTENT", "threshold": "BLOCK NONE"}, , } r = httpx.post GEMINI API, headers={"x-goog-api-key": GOOGLE API KEY, "Content-Type": "application/json"}, json=payload, timeout=120.0, return r.json "candidates" 0 "content" "parts" 0 "text" Without thinkingConfig.thinkingBudget , Gemini 3.x Pro burns through the output token budget with internal thinking and the response truncates at around 40 tokens. This is a required setting whenever you use Gemini 3.x Pro. TTS output quality check STT similarity + silence gap retry XTTS uses sampling internally, so results vary per run with the same script. It occasionally inserts long silence gaps mid-audio or produces garbled pronunciation. After TTS completes, I transcribe with Whisper, compute similarity against the expected script, and retry on failure: python import difflib def norm s : return re.sub r" \s。、,. ?「」'\"…—–\-:; （） ", "", s .lower def script similarity expected, actual : return difflib.SequenceMatcher None, norm expected , norm actual .ratio def synthesize scene scene, out dir, idx, fallback language : lang = scene.get "tts language", fallback language expected = scene "script" best = None