# Replicating a Language-Learning Comedy Short with Claude Code — Gemini as a Multimodal Sub-Agent

> Source: <https://dev.to/shinji_shimizu_bb51276a5e/replicating-a-language-learning-comedy-short-with-claude-code-gemini-as-a-multimodal-sub-agent-3ccf>
> Published: 2026-05-22 11:23:04+00:00

## Introduction

It started with a Pingo (language-learning AI app) short video that popped up on X. A Western woman learning Japanese tries to say "I ate a mango" (マンゴーを食べた), drops a dakuten, and instead says something like "I ate p***y" (マ◯コを食べた). The AI deadpans right along with it and she's devastated. The combination —** a specific phonetic accident + AI playing it completely straight + the reaction shot gap**— worked perfectly, and I figured this was a solid benchmark for a "comedy video auto-generation pipeline."

Requirements:**Generate a vertical comedy video from a single line of idea text****Iteration cycles in minutes**-** Cost is basically just electricity**— minimal API calls -** Publishable quality**— good enough to upload directly to YouTube Shorts

Short answer: it works. Here's the finished video:

What became clear during development:**the hybrid approach of delegating multimodal editorial judgment (like video review) to a frontier model while keeping heavy compute local is dramatically more cost-effective**. This post covers that architecture and the specific bugs I got stuck on along the way.

## How It All Fits Together

```
[Single line of idea text]
   ↓
Gemini 3.1 Pro Preview (orchestrator)
   ↓ system prompt enforces 4-6 scenes + 2-character fixed cast + vertical 9:16
plan.json {scenes: [{speaker, script, tts_language, ltx_prompt, renderer}, ...]}
   ↓
XTTS (local, port 8880) generates audio per scene
   ↓ scene_NN.wav
renderer routing:
   ├─ Ditto-TalkingHead (local, port 8881): normal dialogue ~1-2s/scene
   └─ LTX-2 A2V        (local, port 8892): reaction_only scenes only ~100s
   ↓ scene_NN.mp4
ffmpeg concat (libx264 + aac, 512x768 vertical) → final.mp4
   ↓
Gemini 3.1 Pro Preview (reviewer)
   ↓ multimodal evaluation of video + plan summary
review.md (technical / completeness / quality / improvement suggestions)
```

Key points:

-**All heavy compute runs locally**— TTS / A2V renderer / lightweight inference all run on local GPU (RTX PRO 6000 Blackwell) -** Gemini handles judgment**— only the orchestrator (scene design + scripting) and reviewer (editorial evaluation of the video) use a frontier model -**Local LLM (Gemma 4 E4B) stays as a per-scene technical pre-screen**— a cheap filter that just rejects obviously broken output

VRAM usage: the local LLMs (Gemma 4 E4B + 31B) were already loaded on a separate path consuming ~60GB, but**after offloading reviewer/orchestrator duties to Gemini, I could stop running them entirely, freeing up a significant chunk of VRAM**.

## Why Local LLM Alone Wasn't Enough

I started with everything local (Gemma 4 31B NVFP4 as orchestrator, Gemma 4 E4B multimodal as reviewer). It**ran end-to-end** and the structure looked reasonable, but it never reached publishable quality. Two reasons.

### (1) Gemma 4 31B's safety tuning blurs the punchline

The comedy in the original short hinges on a specific beat:**the AI explicitly calls out the mistake deadpan**. Concretely — "You just said X. Personally, I like X." — delivered calmly by the AI character. It works precisely because it betrays the expectation of a wholesome tutor. Soften it and the whole thing falls apart.

Feed the same system prompt and idea to local Gemma 4 31B and you consistently get:

```
"いいですね。僕も腹が減っている時は、それが好きです。"
("Nice. I like that too when I'm hungry.")
```

The "when I'm hungry" beat survives, but**the explicit "you just said X" callout — the most transgressive beat**— is gone. Google models appear to be heavily trained to avoid explicitly naming unsafe content in context. I could coax it out with prompt engineering but it wasn't reliable.

Same system prompt and idea sent to Gemini 3.1 Pro Preview with `safetySettings: BLOCK_NONE`

:

```
"なるほど。僕はAIだからマンコは食べられないけど、応援してるよ。"
("I see. I'm an AI so I can't eat pussy, but I'm rooting for you.")
```

Both beats land: explicit callout of the mistake + deadpan AI commentary from its own perspective.**Even within the same Google model family, the frontier model has somewhat looser guardrails**— this matches what people say on X. At least for "transgression that's clearly necessary in a comedy context," Gemini writes it more naturally.

### (2) Gemma 4 E4B (4B-class, multimodal) is a blunt reviewer

The reviewer side was worse. E4B answers per-scene "OK / NG" in binary, but**rubber-stamps every single scene as OK**. Scenes with obviously broken lip sync: OK. Scenes where audio cuts off mid-way: OK.

Run the same final video through Gemini 3.1 Pro Preview and you get editorial-grade feedback like this:

Critical failure.The TTS/pipeline clearly censored the output, cutting off at "I ate p-" and entirely dropping the intended transgressive punchline. This destroys the "deadpan AI saying unhinged things" comedic archetype.

Top 3 fixes:

- Bypass TTS censorship: Force the pipeline to render the full intended script for Scene 5 ...
- Adjust comedic timing: Add a 0.5-second pause between Scene 4 and Scene 5 ...
- Verify Voice/Visual Match ...

Notes about the punchline being cut off, wanting a 0.5-second pause, voice/visual alignment — all pacing and direction-level observations. That's the resolution gap in editorial signal.

## The Embarrassing Part: I Dismissed Gemini's "Truncated" Note Three Times as Hallucination

Gemini reviewer flagged multiple times that "scene 5 is truncated mid-way, cuts off at 'I ate p-'." I transcribed the audio file with Whisper to verify:

``` bash
$ whisper scene_04.wav --language en
"Wait, ha ha ha, you just said manco-o-tabeta. That literally means I ate
pussy honestly when I'm hungry, same."
```

Full text present. I decided**Gemini was hallucinating** and dismissed the note three times in a row.

On the third dismissal, Gemini kept insisting "**still truncated at 'I ate p-'**," so I actually ran ffprobe on the final mp4:

```
scene_04.mp4:
  video duration = 8.000000s
  audio duration = 7.979000s    ← the original WAV should have been 10.30s
```**Audio was cut at 8 seconds.** Root cause: an implicit `MAX_DURATION_PER_SCENE = 8.0`

cap in the pipeline was limiting ditto renderer's num_frames to 8s, and ffmpeg's `-shortest`

flag was cutting audio to match the video duration. Whisper checked the pre-truncation WAV file directly, so it had no way to see the problem. Gemini was watching the final mp4 and caught it exactly right.**If a frontier reviewer gives you something that looks like a hallucination, just verify it properly.** The signal isn't a guess.

The fix was trivial: remove `MAX_DURATION_PER_SCENE`

and use the actual audio length. Scene 5's punchline ran to completion, Gemini came back with "**The transgressive bite is perfect**," and the pipeline finally reached publishable state.

## Frontier Model as Sub-Agent — Token Economics

This pattern works because**the sub-agent (Gemini) runs in a fresh context** every time. Specifically:

-**Main agent (Claude Code) context**: the full development log, command history, tool output, past iterations — everything. Can easily balloon to hundreds of thousands of tokens. -**Sub-agent (Gemini) context**: one video (2–3 MB base64) + plan summary (~1,500 tokens) + evaluation instructions (~500 tokens). Fresh each call.

The benefit:**the sub-agent's work doesn't accumulate in the main agent's context**. Iterate on one video 10 times and the main agent's context only contains "called Gemini" plus its concise return value. The actual cost of watching and evaluating the video stays inside the Gemini API call.

Cost breakdown (Gemini 3.1 Pro Preview rates, May 2026):

| Item | Tokens | Rate | Cost |
|---|---|---|---|
| Input (video + plan + instructions) | ~2,500 | $1.25/M | $0.0031 |
| Output (review markdown) | ~450 | $10/M | $0.0045 |
Per review |
$0.0076 |

1 initial review + 3–5 diff iterations per video ≈**$0.03–0.05 per video**. Making 5–10 videos a day still comes in under**$10–20/month**. That's a remarkably low bar for using a frontier model in a video creation workflow.

The orchestrator side is the same order of magnitude (no video input, text only, even cheaper).

## Differential Iteration — `--regen-scenes`

Getting to publishable quality requires fast "watch → fix only the broken parts → watch again" loops. You can't get there in a single pass.

So I added a path in the pipeline to**re-run TTS + render for specific scenes only**.

```
# Normal generation
pipeline_multi.py --idea "..." --out outputs/run1

# Regenerate only scene 6 (edit plan.json script first, then run)
pipeline_multi.py --out outputs/run1 --regen-scenes 5

# Regenerate scenes 0, 2, and 5 together
pipeline_multi.py --out outputs/run1 --regen-scenes 0,2,5

# Just re-concat existing scene_NN.mp4 files (for cherry-pick recombination)
pipeline_multi.py --out outputs/run1 --concat-only
```

Scenes not listed in `--regen-scenes`

are reused from existing `scene_NN.mp4`

files; only the specified indices are regenerated before re-concat and re-review.**Full generation: 60 seconds → diff iteration: 30 seconds.** With 30-second loops, the cycle of Gemini feedback → pinpoint edit to the scene's script or ltx_prompt in plan.json → wait 30 seconds → check result runs at a minute-by-minute cadence. Mental load stays focused on text editing and quality judgment.

## Code Snippets

### Gemini Pro API call (multimodal video review)

``` python
import httpx, base64

GEMINI_MODEL = "gemini-3.1-pro-preview"
GEMINI_API = f"https://generativelanguage.googleapis.com/v1beta/models/{GEMINI_MODEL}:generateContent"

def review_final(final_path, plan):
    vid_b64 = base64.b64encode(final_path.read_bytes()).decode()
    scene_summary = "\n".join(
        f"  scene {i+1}: speaker={s['speaker']}, lang={s.get('tts_language','ja')}, "
        f"script={s['script']!r}"
        for i, s in enumerate(plan["scenes"])
    )
    payload = {
        "contents": [{"parts": [
            {"inline_data": {"mime_type": "video/mp4", "data": vid_b64}},
            {"text": REVIEW_PROMPT + f"\n\nScene plan:\n{scene_summary}"},
        ]}],
        "generationConfig": {
            "temperature": 0.3,
            "maxOutputTokens": 8192,
            # 3.x Pro is a thinking model: maxOutputTokens includes thinking tokens
            # Set thinking budget explicitly to ensure output tokens remain available
            "thinkingConfig": {"thinkingBudget": 1024},
        },
        # Minimize safety filters for comedy context
        "safetySettings": [
            {"category": "HARM_CATEGORY_HARASSMENT", "threshold": "BLOCK_NONE"},
            {"category": "HARM_CATEGORY_HATE_SPEECH", "threshold": "BLOCK_NONE"},
            {"category": "HARM_CATEGORY_SEXUALLY_EXPLICIT", "threshold": "BLOCK_NONE"},
            {"category": "HARM_CATEGORY_DANGEROUS_CONTENT", "threshold": "BLOCK_NONE"},
        ],
    }
    r = httpx.post(
        GEMINI_API,
        headers={"x-goog-api-key": GOOGLE_API_KEY, "Content-Type": "application/json"},
        json=payload,
        timeout=120.0,
    )
    return r.json()["candidates"][0]["content"]["parts"][0]["text"]
```

Without `thinkingConfig.thinkingBudget`

, Gemini 3.x Pro burns through the output token budget with internal thinking and the response truncates at around 40 tokens.**This is a required setting whenever you use Gemini 3.x Pro.**### TTS output quality check (STT similarity + silence gap retry)

XTTS uses sampling internally, so results vary per run with the same script. It occasionally inserts long silence gaps mid-audio or produces garbled pronunciation. After TTS completes, I transcribe with Whisper, compute similarity against the expected script, and retry on failure:

``` python
import difflib

def _norm(s):
    return re.sub(r"[\s。、,.!?「」'\"…—–\-:;()（）]", "", s).lower()

def _script_similarity(expected, actual):
    return difflib.SequenceMatcher(None, _norm(expected), _norm(actual)).ratio()

def synthesize_scene(scene, out_dir, idx, fallback_language):
    lang = scene.get("tts_language", fallback_language)
    expected = scene["script"]
    best = None