{"slug": "turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline", "title": "Turning a 1-Line Idea Into a 40-Second Short with a 10-Beat Local Video Pipeline", "summary": "Fully automated, local GPU pipeline that transforms a single-line idea into a 40-second AI-generated short video in 25-30 minutes. The pipeline chains together multiple models, including Gemma 4 31B for structuring the narrative, HiDream for image generation, and LTX-2 for video rendering, with Irodori-TTS handling narration and ffmpeg for final assembly. The author focuses on the system design of chaining models together, sharing specific solutions to common pitfalls like synchronizing narration duration and positioning subtitles to avoid overlapping with the subject's face.", "body_md": "## TL;DR\n\nGemma 4 31B expands a single-line idea into a 10-beat structure. HiDream generates 11 images at 2048², LTX-2 A2V/I2V renders 11 clips, Irodori-TTS handles dialogue and a male narrator, and ffmpeg burns in subtitles and a Hook title overlay — all fully automated. **End-to-end: a 40-second portrait video (512×768) in 25–30 minutes.** One local GPU (96 GB Blackwell), zero API cost.\n\nFinished video (already published):\n\n## Who This Is For\n\nIndividual developers who want to mass-produce AI comedy shorts on a local GPU. The focus isn't on any single model — it's on **the design of chaining multiple models into one operational pipeline**.\n\n## What I Built\n\nI automated a dark-comedy format — a short-video style I called `consent_dilemma`\n\n— from a one-line idea all the way to a finished 40-second video.\n\nFinished structure:\n\n-\n**Hook (0–5s)**: Extreme close-up of a beautiful woman + narrator \"The fate of the man who answered 'You're a guy, aren't you'——\" + large title overlay -\n**Main section (5–37s)**: Movie theater date → \"Can I kiss you?\" → \"No… stop it…\" → dejection → \"Why aren't you more assertive? You're a guy, aren't you?\" → realization → kiss -\n**Punchline (37–40s)**: Courtroom — \"The defendant is sentenced to 3 years for non-consensual intercourse\" + gavel \"Knock!\" + tears in a jail cell\n\nBefore / after:\n\n| Traditional approach | This pipeline | |\n|---|---|---|\n| Idea → published video | 2–3 days (manual editing) |\n25–30 minutes (fully automated) |\n| API cost | Hundreds of yen per video (DALL-E + video gen) |\n¥0 (electricity only) |\n| Subtitles | Write SRT by hand | Auto-split on punctuation and burned in |\n| Hook | Shot separately | Integrated into the pipeline |\n\n## Architecture\n\n```\n[Stage A] Gemma 4 31B (vllm, port 8894) → plan.json (10 beats + hook)\n[Stage B] HiDream-O1-Image (port 8895) → 11 images at 2048²\n          + Gemma 4 31B multimodal visual judge (--judge --max-retries 2)\n[Stage C] Irodori-TTS (port 8880) + LTX-2 A2V (port 8892) / I2V (port 8891)\n          → 11 clips + Hook clip → ffmpeg concat → subtitle burn-in\n```\n\nImplementation lives under [ llm_server/storyboard/](https://github.com/zhener562/hage/tree/main/llm_server/storyboard) (pipeline.py / visual.py / judge.py / video.py / render.py / run.py).\n\n## The 10-Beat `consent_dilemma`\n\nFormat\n\nFixed as a system prompt via `CONSENT_DILEMMA_SYSTEM`\n\nin `prompts.py`\n\n:\n\n| # | type | speaker | renderer | content |\n|---|---|---|---|---|\n| 1 | provocation | b | LTX-2 A2V | Suggestive invitation |\n| 2 | ask | a | LTX-2 A2V | Earnest consent check |\n| 3 | refusal | b | LTX-2 A2V | Soft refusal (ambiguous form like \"No… stop it…\") |\n| 4 | dejection | a (silent) | LTX-2 I2V | Dejection |\n| 5 | gaslight | b | LTX-2 A2V | Contradictory leading statement |\n| 6 | pause | a (silent) | LTX-2 I2V | Brief realization |\n| 7 | kiss |\na (silent) | LTX-2 I2V | The moment of the kiss |\n| 8 | verdict | judge | LTX-2 A2V | Fast-paced court verdict |\n| 9 | gavel_se |\njudge | LTX-2 I2V (keep_audio) | Gavel + AI-generated \"Knock!\" sound |\n| 10 | jail |\na (silent) | LTX-2 I2V | Tears in a jail cell |\n\nThree key structural choices:\n\n-\n**Don't make the refusal a flat \"No\"**: Stretch it into something like \"No… stop it…\" with trailing inflection, conveying the \"performative No that doesn't mean No\" nuance. This is what makes the gaslight's contradiction land later. -\n**Don't jump straight from gaslight to kiss**: Insert a \"pause\" (realization beat) of ~1.5 seconds. This controls tempo and the emotional curve. -\n**Two-stage punchline — verdict then jail**: The verdict alone feels abrupt. Showing him crying in a cell makes \"he actually got convicted\" click.\n\n## Hook Design (The TikTok 3-Second Problem)\n\nOn portrait short-form video, drop-off is decided in the first 3 seconds. A Hook segment is prepended before the 10 main beats:\n\n```\n\"hook\": {\n  \"title_overlay\": \"No Means Yes?\",\n  \"narrator_line\": \"The fate of the man who answered 'You're a guy, aren't you'——\",\n  \"image_prompt\": \"ultra close-up of beautiful Japanese woman, half-lidded eyes, ...\",\n  \"duration_sec\": 3.5\n}\n```\n\nTwo implementation pitfalls:\n\n**Pitfall 1: narrator TTS duration exceeds duration_sec, cutting the audio.** The final syllable of the narrator line got clipped. Fix: generate TTS first → measure with\n\n`ffprobe`\n\n→ pass `max(plan_duration, narrator + 0.6)`\n\nas the I2V duration.\n\n```\nnarrator_dur = _ffprobe_duration(narrator_wav)\nduration = max(float(hook.get(\"duration_sec\", 0.0)), narrator_dur + 0.6)\nltx_i2v_clip(portrait, i2v_prompt, duration, silent_video, keep_audio=False)\n```\n\n**Pitfall 2: drawtext y position.**\n\n`y=h*0.30`\n\n(one-third down the screen) overlapped the face. Changed to `y=20`\n\n(absolute 20 px) to pin the title to the very top.##\n\nSubtitle Burn-In (Silent Viewing Support)\n\nBurned-in subtitles for users watching without sound on the train, and for cross-platform reliability.\n\n```\nstyle = (\n    \"FontName=Noto Sans CJK JP,FontSize=18,PrimaryColour=&H00FFFFFF,\"\n    \"OutlineColour=&H00000000,Outline=2,Shadow=0,BorderStyle=1,\"\n    \"Alignment=2,MarginV=60,Bold=1\"\n)\n# ffmpeg -i raw.mp4 -vf \"subtitles=subs.srt:force_style='...\"\n```\n\n`Alignment=2`\n\n= bottom center. `MarginV=60`\n\ngives breathing room from the bottom edge.\n\n**Long-line splitting**: A line of 30+ characters within one beat covers the face. `_split_subtitle`\n\nsplits on `。．！？`\n\n→ greedy-packs into chunks of ≤28 characters → distributes beat duration evenly across chunks:\n\nInput:\n\n言葉で確認するのなんてロマンチックじゃないよね。ねえ、もっと積極的になってよ。男の子でしょ？\n\nOutput (one 8.9s beat split into 2 timed chunks):\n\n| Time | Subtitle |\n|---|---|\n| 15.16–19.63s | 言葉で確認するのなんてロマンチックじゃないよね。 |\n| 19.63–24.10s | ねえ、もっと積極的になってよ。男の子でしょ？ |\n\n## Using LTX-2 I2V as a Sound Effect Generator (`gavel_se`\n\n)\n\nLTX-2 distilled embeds **AI-generated audio (ambient sound / sound effects) directly into the I2V output mp4**. Unless you explicitly drop it with `ffmpeg -map 0:v:0 -map 1:a:0`\n\n, whatever the prompt describes comes with sound.\n\nI repurposed this as an SFX generator:\n\n``` python\ndef render_se_tail_beat(sb_dir, beat, prior_clip, work_dir):\n    # 1. Extract the last frame of the previous beat\n    extract_last_frame(prior_clip, last_frame_png)\n    # 2. Feed that image into I2V, request SFX via prompt\n    prompt = build_gavel_se_prompt(beat)\n    return ltx_i2v_clip(last_frame_png, prompt, duration, clip_path, keep_audio=True)\n```\n\nAdded a `keep_audio=True`\n\nflag to `ltx_i2v_clip`\n\nso the audio isn't dropped during ffmpeg re-encoding.\n\nPrompt for `gavel_se`\n\n:\n\n```\n\"Single decisive arm motion of the judge bringing the gavel down sharply \"\n\"onto the wooden bench. Loud sharp wood-on-wood thwack impact sound. \"\n\"Brief, contained, no other motion in the frame.\"\n```\n\nLast frame of the judge + gavel prompt → \"Knock!\" sound. If that misses, the design falls back to something like the Ace Attorney SFX.\n\n## Pitfall Log\n\nFive major pitfalls hit during development:\n\n### 1. Codex CLI hangs with vLLM 0.20.2\n\nSending a system prompt + idea via `codex exec -p gemma4`\n\nhung at 0% CPU for 20+ minutes during the `/v1/responses`\n\nhandshake. Piping subprocess output through `tail -200`\n\nwas also suppressing early stderr.\n\nFix: Dropped Codex entirely, hit `/v1/chat/completions`\n\ndirectly with `urllib.request`\n\n. Used `response_format={\"type\":\"json_object\"}`\n\nto force JSON. `plan.json`\n\ngenerated in 25 seconds.\n\n### 2. HiDream won't remove the cinema screen\n\nEven with `\"The movie screen is BEHIND the camera and NOT VISIBLE in frame\"`\n\nin the setting prompt, the screen persisted in the background through 2048/50 steps.\n\nFix: Generate `scene_base`\n\nvia T2I → feed that same image into I2I edit with a prompt to \"replace screen with dark wall, keep character positions identical\" → gone in one shot. Two-stage pipeline: low-res → I2I fix → regenerate all beats at full resolution.\n\n### 3. HiDream turns lips-on-lips into a cheek kiss\n\nWith standard prompting, HiDream tends to interpret kiss as a cheek kiss. You need directives at the level of `\"CRITICAL: their LIPS meet directly — mouth-to-mouth contact at the CENTER of the frame. NOT a cheek kiss\"`\n\n. Added a dedicated early-return block in `_beat_edit_prompt`\n\nfor the kiss beat.\n\n### 4. `CAST`\n\n/ `CROP_BOX`\n\n/ `SPEAKER_A2V_PROMPT`\n\nare hardcoded for two characters\n\nThree dictionaries — `CAST`\n\n, `CROP_BOX`\n\n, `SPEAKER_A2V_PROMPT`\n\n— only know `a`\n\n(Kenta) and `b`\n\n(Misaki). Adding judge/narrator requires updating all three simultaneously (you find out via `KeyError`\n\n). Also added branching in `render_speech_beat_ltx_a2v`\n\nso beats with `setting_override`\n\ncrop from the beat's own image rather than `scene_base`\n\n.\n\n### 5. Gemma 4 multimodal judge has too many false positives\n\n`storyboard/judge.py`\n\nsends beat images + expected expressions to Gemma 4 31B for YES/NO visual judgment. It does catch **obvious** failures like wrong finger count, open-mouth pose on a silent beat, or scene geometry mismatch — but hammers FAIL on subtle cases like \"subtle shy expression.\"\n\nIn practice: accept and proceed after 3 consecutive FAILs with max-retries 2. Automating the threshold for escalating to a frontier reviewer (Gemini 3.1 Pro) is still a TODO.\n\n## VRAM Layout\n\nBreakdown on a 96 GB Blackwell Max-Q:\n\n| Process | idle (GiB) | peak (GiB) |\n|---|---|---|\n| Gemma 4 31B (NVFP4) | 38 | 38 |\n| HiDream-O1-Image | 16 | 33 |\n| TTS server | 3 | 3 |\n| Ditto | 3 | 3 |\n| LTX-2 A2V (cold-start fp8-cast) | 1 | 24 |\n| LTX-2 T2V/I2V (cold-start) | 1 | 8 |\n\nAll at peak simultaneously = 109 GiB → OOM. Operational flow:\n\n-\n**Stage A**: Gemma 31B + HiDream idle → peak ~62 GiB -\n**Stage B with judge**: Gemma 31B + HiDream peak → ~73 GiB -\n**Before final render:**→ 38 GiB freed`pkill -f \"vllm.*gemma\"`\n\nkills Gemma -\n**Stage B final render (2048/50)**: HiDream peak ~33 GiB -\n**Before Stage C:**→ 16 GiB freed`lsof -ti tcp:8895 | xargs kill`\n\nkills HiDream -\n**Stage C**: LTX-2 + TTS + Ditto → peak ~32 GiB\n\nExplicit kills at stage transitions, and everything fits on one card.\n\n## Iteration Loop (Cache Strategy)\n\n**Partial regeneration** — not \"rebuild everything\" — is what keeps iteration fast:\n\n```\n# Regen a single beat image (HiDream only)\npython -m storyboard.visual --plan ... --out ... --only-beat 7 --steps 50 --resolution 2048\n\n# Partial video regen (TTS + LTX-2)\npython -m storyboard.video --dir ... --regen-beats 5,6,7 --skip-review\n\n# Adjust only subtitle or Hook title position\nrm _video_work/clip_00_hook.mp4 _video_work/subs_irodori.srt\npython -m storyboard.video --dir ... --regen-beats none --skip-review   # ~30 seconds\n```\n\n**Cache hierarchy**:\n\n- HiDream beat images (\n`beat_NN_<type>.png`\n\n) — regenerate individually with`--only-beat`\n\nin ~80 seconds - A2V / I2V clips (\n`clip_NN_*.mp4`\n\n) — invalidated when beat type / speaker / line changes - Finished Hook clip (\n`clip_00_hook.mp4`\n\n) — delete just this when adjusting title position (the heavy LTX-2 I2V`hook_silent.mp4`\n\nis reused) - Subtitle SRT — regenerated every time (~10 seconds)\n\nTitle position / subtitle style / Hook copy tweaks re-render in 30 seconds. The 100-second LTX-2 I2V portion stays cached.\n\n## How This Fits Into Kotonia\n\nVideos generated by this pipeline feed the SNS distribution layer (TikTok / YouTube Shorts / IG Reels) — the top of the funnel for attention → conversion for Kotonia (kotonia.ai).\n\nTechnically, it's an extension of the `/studio/`\n\nstack (HiDream image generation) into the video direction. The plan is to eventually expose this as `/video-studio/`\n\n— a one-click Web UI over the same pipeline. Right now it's CLI only.\n\n## Related Articles / Want to Try It?\n\n-\n[Running HiDream-O1-Image's 5 modes resident on 1 GPU](https://kotonia.ai/articles/)— backend design for Studio (`/studio/`\n\n) -\n[Fitting LTX-2 onto a single 95 GB GPU with fp8-cast quantization](https://kotonia.ai/articles/)— the Stage C video generation foundation -\n[Reproducing language-learning short videos with Claude Code](https://kotonia.ai/articles/)— earlier 6-beat \"mango incident\" format implementation - Want to try the image generation side?\n[/studio/](https://kotonia.ai/studio/)lets you do it in one click (video pipeline CLI is self-host only for now)", "url": "https://wpnews.pro/news/turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline", "canonical_source": "https://dev.to/shinji_shimizu_bb51276a5e/turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline-cjb", "published_at": "2026-05-22 11:23:08+00:00", "updated_at": "2026-05-22 11:34:00.810628+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "open-source", "developer-tools", "products"], "entities": ["Gemma 4 31B", "HiDream", "LTX-2", "Irodori-TTS", "ffmpeg", "vllm", "consent_dilemma"], "alternates": {"html": "https://wpnews.pro/news/turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline", "markdown": "https://wpnews.pro/news/turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline.md", "text": "https://wpnews.pro/news/turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline.txt", "jsonld": "https://wpnews.pro/news/turning-a-1-line-idea-into-a-40-second-short-with-a-10-beat-local-video-pipeline.jsonld"}}