# Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline

> Source: <https://dev.to/shinji_shimizu_bb51276a5e/turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline-cjb>
> Published: 2026-05-22 11:23:08+00:00

## TL;DR

Gemma 4 31B expands a single-line idea into a 10-beat structure. HiDream generates 11 images at 2048², LTX-2 A2V/I2V renders 11 clips, Irodori-TTS handles dialogue and a male narrator, and ffmpeg burns in subtitles and a Hook title overlay — all fully automated. **End-to-end: a 40-second portrait video (512×768) in 25–30 minutes.** One local GPU (96 GB Blackwell), zero API cost.

Finished video (already published):

## Who This Is For

Individual developers who want to mass-produce AI comedy shorts on a local GPU. The focus isn't on any single model — it's on **the design of chaining multiple models into one operational pipeline**.

## What I Built

I automated a dark-comedy format — a short-video style I called `consent_dilemma`

— from a one-line idea all the way to a finished 40-second video.

Finished structure:

-
**Hook (0–5s)**: Extreme close-up of a beautiful woman + narrator "The fate of the man who answered 'You're a guy, aren't you'——" + large title overlay -
**Main section (5–37s)**: Movie theater date → "Can I kiss you?" → "No… stop it…" → dejection → "Why aren't you more assertive? You're a guy, aren't you?" → realization → kiss -
**Punchline (37–40s)**: Courtroom — "The defendant is sentenced to 3 years for non-consensual intercourse" + gavel "Knock!" + tears in a jail cell

Before / after:

| Traditional approach | This pipeline | |
|---|---|---|
| Idea → published video | 2–3 days (manual editing) |
25–30 minutes (fully automated) |
| API cost | Hundreds of yen per video (DALL-E + video gen) |
¥0 (electricity only) |
| Subtitles | Write SRT by hand | Auto-split on punctuation and burned in |
| Hook | Shot separately | Integrated into the pipeline |

## Architecture

```
[Stage A] Gemma 4 31B (vllm, port 8894) → plan.json (10 beats + hook)
[Stage B] HiDream-O1-Image (port 8895) → 11 images at 2048²
          + Gemma 4 31B multimodal visual judge (--judge --max-retries 2)
[Stage C] Irodori-TTS (port 8880) + LTX-2 A2V (port 8892) / I2V (port 8891)
          → 11 clips + Hook clip → ffmpeg concat → subtitle burn-in
```

Implementation lives under [ llm_server/storyboard/](https://github.com/zhener562/hage/tree/main/llm_server/storyboard) (pipeline.py / visual.py / judge.py / video.py / render.py / run.py).

## The 10-Beat `consent_dilemma`

Format

Fixed as a system prompt via `CONSENT_DILEMMA_SYSTEM`

in `prompts.py`

:

| # | type | speaker | renderer | content |
|---|---|---|---|---|
| 1 | provocation | b | LTX-2 A2V | Suggestive invitation |
| 2 | ask | a | LTX-2 A2V | Earnest consent check |
| 3 | refusal | b | LTX-2 A2V | Soft refusal (ambiguous form like "No… stop it…") |
| 4 | dejection | a (silent) | LTX-2 I2V | Dejection |
| 5 | gaslight | b | LTX-2 A2V | Contradictory leading statement |
| 6 | pause | a (silent) | LTX-2 I2V | Brief realization |
| 7 | kiss |
a (silent) | LTX-2 I2V | The moment of the kiss |
| 8 | verdict | judge | LTX-2 A2V | Fast-paced court verdict |
| 9 | gavel_se |
judge | LTX-2 I2V (keep_audio) | Gavel + AI-generated "Knock!" sound |
| 10 | jail |
a (silent) | LTX-2 I2V | Tears in a jail cell |

Three key structural choices:

-
**Don't make the refusal a flat "No"**: Stretch it into something like "No… stop it…" with trailing inflection, conveying the "performative No that doesn't mean No" nuance. This is what makes the gaslight's contradiction land later. -
**Don't jump straight from gaslight to kiss**: Insert a "pause" (realization beat) of ~1.5 seconds. This controls tempo and the emotional curve. -
**Two-stage punchline — verdict then jail**: The verdict alone feels abrupt. Showing him crying in a cell makes "he actually got convicted" click.

## Hook Design (The TikTok 3-Second Problem)

On portrait short-form video, drop-off is decided in the first 3 seconds. A Hook segment is prepended before the 10 main beats:

```
"hook": {
  "title_overlay": "No Means Yes?",
  "narrator_line": "The fate of the man who answered 'You're a guy, aren't you'——",
  "image_prompt": "ultra close-up of beautiful Japanese woman, half-lidded eyes, ...",
  "duration_sec": 3.5
}
```

Two implementation pitfalls:

**Pitfall 1: narrator TTS duration exceeds duration_sec, cutting the audio.** The final syllable of the narrator line got clipped. Fix: generate TTS first → measure with

`ffprobe`

→ pass `max(plan_duration, narrator + 0.6)`

as the I2V duration.

```
narrator_dur = _ffprobe_duration(narrator_wav)
duration = max(float(hook.get("duration_sec", 0.0)), narrator_dur + 0.6)
ltx_i2v_clip(portrait, i2v_prompt, duration, silent_video, keep_audio=False)
```

**Pitfall 2: drawtext y position.**

`y=h*0.30`

(one-third down the screen) overlapped the face. Changed to `y=20`

(absolute 20 px) to pin the title to the very top.##

Subtitle Burn-In (Silent Viewing Support)

Burned-in subtitles for users watching without sound on the train, and for cross-platform reliability.

```
style = (
    "FontName=Noto Sans CJK JP,FontSize=18,PrimaryColour=&H00FFFFFF,"
    "OutlineColour=&H00000000,Outline=2,Shadow=0,BorderStyle=1,"
    "Alignment=2,MarginV=60,Bold=1"
)
# ffmpeg -i raw.mp4 -vf "subtitles=subs.srt:force_style='..."
```

`Alignment=2`

= bottom center. `MarginV=60`

gives breathing room from the bottom edge.

**Long-line splitting**: A line of 30+ characters within one beat covers the face. `_split_subtitle`

splits on `。．！？`

→ greedy-packs into chunks of ≤28 characters → distributes beat duration evenly across chunks:

Input:

言葉で確認するのなんてロマンチックじゃないよね。ねえ、もっと積極的になってよ。男の子でしょ？

Output (one 8.9s beat split into 2 timed chunks):

| Time | Subtitle |
|---|---|
| 15.16–19.63s | 言葉で確認するのなんてロマンチックじゃないよね。 |
| 19.63–24.10s | ねえ、もっと積極的になってよ。男の子でしょ？ |

## Using LTX-2 I2V as a Sound Effect Generator (`gavel_se`

)

LTX-2 distilled embeds **AI-generated audio (ambient sound / sound effects) directly into the I2V output mp4**. Unless you explicitly drop it with `ffmpeg -map 0:v:0 -map 1:a:0`

, whatever the prompt describes comes with sound.

I repurposed this as an SFX generator:

``` python
def render_se_tail_beat(sb_dir, beat, prior_clip, work_dir):
    # 1. Extract the last frame of the previous beat
    extract_last_frame(prior_clip, last_frame_png)
    # 2. Feed that image into I2V, request SFX via prompt
    prompt = build_gavel_se_prompt(beat)
    return ltx_i2v_clip(last_frame_png, prompt, duration, clip_path, keep_audio=True)
```

Added a `keep_audio=True`

flag to `ltx_i2v_clip`

so the audio isn't dropped during ffmpeg re-encoding.

Prompt for `gavel_se`

:

```
"Single decisive arm motion of the judge bringing the gavel down sharply "
"onto the wooden bench. Loud sharp wood-on-wood thwack impact sound. "
"Brief, contained, no other motion in the frame."
```

Last frame of the judge + gavel prompt → "Knock!" sound. If that misses, the design falls back to something like the Ace Attorney SFX.

## Pitfall Log

Five major pitfalls hit during development:

### 1. Codex CLI hangs with vLLM 0.20.2

Sending a system prompt + idea via `codex exec -p gemma4`

hung at 0% CPU for 20+ minutes during the `/v1/responses`

handshake. Piping subprocess output through `tail -200`

was also suppressing early stderr.

Fix: Dropped Codex entirely, hit `/v1/chat/completions`

directly with `urllib.request`

. Used `response_format={"type":"json_object"}`

to force JSON. `plan.json`

generated in 25 seconds.

### 2. HiDream won't remove the cinema screen

Even with `"The movie screen is BEHIND the camera and NOT VISIBLE in frame"`

in the setting prompt, the screen persisted in the background through 2048/50 steps.

Fix: Generate `scene_base`

via T2I → feed that same image into I2I edit with a prompt to "replace screen with dark wall, keep character positions identical" → gone in one shot. Two-stage pipeline: low-res → I2I fix → regenerate all beats at full resolution.

### 3. HiDream turns lips-on-lips into a cheek kiss

With standard prompting, HiDream tends to interpret kiss as a cheek kiss. You need directives at the level of `"CRITICAL: their LIPS meet directly — mouth-to-mouth contact at the CENTER of the frame. NOT a cheek kiss"`

. Added a dedicated early-return block in `_beat_edit_prompt`

for the kiss beat.

### 4. `CAST`

/ `CROP_BOX`

/ `SPEAKER_A2V_PROMPT`

are hardcoded for two characters

Three dictionaries — `CAST`

, `CROP_BOX`

, `SPEAKER_A2V_PROMPT`

— only know `a`

(Kenta) and `b`

(Misaki). Adding judge/narrator requires updating all three simultaneously (you find out via `KeyError`

). Also added branching in `render_speech_beat_ltx_a2v`

so beats with `setting_override`

crop from the beat's own image rather than `scene_base`

.

### 5. Gemma 4 multimodal judge has too many false positives

`storyboard/judge.py`

sends beat images + expected expressions to Gemma 4 31B for YES/NO visual judgment. It does catch **obvious** failures like wrong finger count, open-mouth pose on a silent beat, or scene geometry mismatch — but hammers FAIL on subtle cases like "subtle shy expression."

In practice: accept and proceed after 3 consecutive FAILs with max-retries 2. Automating the threshold for escalating to a frontier reviewer (Gemini 3.1 Pro) is still a TODO.

## VRAM Layout

Breakdown on a 96 GB Blackwell Max-Q:

| Process | idle (GiB) | peak (GiB) |
|---|---|---|
| Gemma 4 31B (NVFP4) | 38 | 38 |
| HiDream-O1-Image | 16 | 33 |
| TTS server | 3 | 3 |
| Ditto | 3 | 3 |
| LTX-2 A2V (cold-start fp8-cast) | 1 | 24 |
| LTX-2 T2V/I2V (cold-start) | 1 | 8 |

All at peak simultaneously = 109 GiB → OOM. Operational flow:

-
**Stage A**: Gemma 31B + HiDream idle → peak ~62 GiB -
**Stage B with judge**: Gemma 31B + HiDream peak → ~73 GiB -
**Before final render:**→ 38 GiB freed`pkill -f "vllm.*gemma"`

kills Gemma -
**Stage B final render (2048/50)**: HiDream peak ~33 GiB -
**Before Stage C:**→ 16 GiB freed`lsof -ti tcp:8895 | xargs kill`

kills HiDream -
**Stage C**: LTX-2 + TTS + Ditto → peak ~32 GiB

Explicit kills at stage transitions, and everything fits on one card.

## Iteration Loop (Cache Strategy)

**Partial regeneration** — not "rebuild everything" — is what keeps iteration fast:

```
# Regen a single beat image (HiDream only)
python -m storyboard.visual --plan ... --out ... --only-beat 7 --steps 50 --resolution 2048

# Partial video regen (TTS + LTX-2)
python -m storyboard.video --dir ... --regen-beats 5,6,7 --skip-review

# Adjust only subtitle or Hook title position
rm _video_work/clip_00_hook.mp4 _video_work/subs_irodori.srt
python -m storyboard.video --dir ... --regen-beats none --skip-review   # ~30 seconds
```

**Cache hierarchy**:

- HiDream beat images (
`beat_NN_<type>.png`

) — regenerate individually with`--only-beat`

in ~80 seconds - A2V / I2V clips (
`clip_NN_*.mp4`

) — invalidated when beat type / speaker / line changes - Finished Hook clip (
`clip_00_hook.mp4`

) — delete just this when adjusting title position (the heavy LTX-2 I2V`hook_silent.mp4`

is reused) - Subtitle SRT — regenerated every time (~10 seconds)

Title position / subtitle style / Hook copy tweaks re-render in 30 seconds. The 100-second LTX-2 I2V portion stays cached.

## How This Fits Into Kotonia

Videos generated by this pipeline feed the SNS distribution layer (TikTok / YouTube Shorts / IG Reels) — the top of the funnel for attention → conversion for Kotonia (kotonia.ai).

Technically, it's an extension of the `/studio/`

stack (HiDream image generation) into the video direction. The plan is to eventually expose this as `/video-studio/`

— a one-click Web UI over the same pipeline. Right now it's CLI only.

## Related Articles / Want to Try It?

-
[Running HiDream-O1-Image's 5 modes resident on 1 GPU](https://kotonia.ai/articles/)— backend design for Studio (`/studio/`

) -
[Fitting LTX-2 onto a single 95 GB GPU with fp8-cast quantization](https://kotonia.ai/articles/)— the Stage C video generation foundation -
[Reproducing language-learning short videos with Claude Code](https://kotonia.ai/articles/)— earlier 6-beat "mango incident" format implementation - Want to try the image generation side?
[/studio/](https://kotonia.ai/studio/)lets you do it in one click (video pipeline CLI is self-host only for now)