{"slug": "i-built-an-ai-studio-that-turns-1-line-of-text-into-a-5-minute-short-drama", "title": "I Built an AI Studio That Turns 1 Line of Text Into a 5-Minute Short Drama", "summary": "A developer built Shortify AI, an open-source platform that converts a single line of text into a complete 1080p short drama episode. The pipeline uses AI for scriptwriting, storyboarding, multi-voice dubbing, and video compositing, compressing a process that normally takes 5-10 people and 1-2 weeks into a single person and 5 minutes.", "body_md": "**From idea to 1080p video — AI scriptwriting, storyboarding, multi-voice dubbing, and video compositing in one pipeline.**\n\nI spent the past month building [Shortify AI](https://github.com/ycbing/Shortify-AI), an open-source platform that takes a creative prompt like \"a time-traveling maid from ancient China lands in a modern office\" and outputs a full short drama episode — script, illustrated storyboard, multi-character voiceover, and a complete 1080p video.\n\nHere's how it works under the hood, the architecture decisions I made, and the full pipeline that ties it together.\n\nChina's short drama market hit $70B in 2025. These are vertical, fast-paced 1–5 minute episodes — basically TikTok meets TV series. The production pipeline is:\n\n**Total: 5–10 people, 1–2 weeks per episode.**\n\nI wanted to compress this to: **1 person, 5 minutes.**\n\nHere's the end-to-end pipeline:\n\n```\nUser Input (\"穿越到现代的女将军\")\n       │\n       ▼\n┌─────────────────────┐\n│   AI Scriptwriter   │  ← GLM-4-Flash / DeepSeek / Qwen\n│  (characters +      │\n│   scenes + shots)   │\n└─────────┬───────────┘\n          │\n          ▼\n┌─────────────────────┐\n│  Storyboard Images  │  ← Wan2.7-image / CogView-3-Plus\n│  (1 image per shot) │\n└─────────┬───────────┘\n          │\n          ▼\n┌─────────────────────┐\n│  Multi-voice TTS    │  ← iFlytek WebSocket / Edge-TTS\n│  (male / female /   │\n│   narrator per role)│\n└─────────┬───────────┘\n          │\n          ▼\n┌─────────────────────┐\n│  Video Compositing  │  ← FFmpeg (Ken Burns + AI video)\n│  (1080p, subtitles, │\n│   background music) │\n└─────────┬───────────┘\n          │\n          ▼\n    COS Storage + Share URL\n```\n\nEach stage runs independently and can be swapped — more on that below.\n\nThe LLM acts as a screenwriting assistant. Given a creative prompt, it generates a structured script with:\n\nThe prompt engineering was the hardest part. Early versions produced flat narration-style scripts. The breakthrough was switching to **dialogue-centric formatting**:\n\n``` js\n// Simplified prompt structure\nconst systemPrompt = `\nGenerate a short drama script in this format:\n{\n  \"characters\": [{ \"name\": \"...\", \"gender\": \"male|female\", \"voiceId\": \"...\" }],\n  \"shots\": [\n    {\n      \"shotNumber\": 1,\n      \"character\": \"...\",\n      \"dialogue\": \"...\",\n      \"sceneDescription\": \"...\",\n      \"cameraDirection\": \"close-up|wide|over-shoulder\"\n    }\n  ]\n}\nRules:\n- Each shot = one line of dialogue\n- Include scene descriptions for every shot\n- Mark characters explicitly so we can assign voice models\n- Total length: 7-12 shots per episode\n`;\n```\n\n**API:** GLM-4-Flash (free tier), but any OpenAI-compatible LLM works via our model resolver layer.\n\nFor each shot, we generate an illustration. The challenge wasn't the image generation itself — it was **managing cost and consistency**.\n\n```\nWan2.7-image (DashScope, ~$0.03/image)  → 2K resolution, synchronous\n  └── fallback → Wanx-v1 (older, cheaper) → 720p, async polling\n    └── fallback → CogView-3-Plus (Zhipu) → fallback with different API\n```\n\nFor character consistency across shots, we inject appearance descriptors into every prompt:\n\n``` js\nfunction buildAppearancePrompt(shot, characters): string {\n  const shotChars = characters.filter(c => shot.character === c.name);\n  const appearanceDesc = shotChars\n    .map(c => `${c.name}: ${c.appearance}`)\n    .join(\". \");\n\n  return `${shot.sceneDescription}. ${appearanceDesc}. \n          Cinematic lighting, 16:9 widescreen, photorealistic.`;\n}\n```\n\nImages are uploaded to Tencent Cloud COS (private bucket) with signed URLs for secure access.\n\nThis was the most fun to build. The LLM script tells us which character speaks in each shot, and we assign voice IDs accordingly:\n\n``` js\nconst voiceMap: Record<string, VoiceConfig> = {\n  \"male-lead\":    { edgeTTS: \"zh-CN-YunxiNeural\",  iFlytek: \"x4_yehaoyun_oral\" },\n  \"female-lead\":  { edgeTTS: \"zh-CN-XiaoxiaoNeural\", iFlytek: \"x4_shisan_oral\" },\n  \"narrator\":     { edgeTTS: \"zh-CN-YunjianNeural\", iFlytek: \"x4_yunbai_oral\" },\n  \"child\":        { edgeTTS: \"zh-CN-XiaoxuanNeural\", iFlytek: \"x4_yunxiaoyan_oral\" },\n};\n```\n\n**Failure handling:** The primary TTS (iFlytek WebSocket) sometimes rate-limits. In that case, the pipeline auto-falls back to Edge-TTS (free, runs locally). The entire voiceover stage runs at ~3 seconds per shot, so a 12-shot episode takes ~36 seconds for all dubbing.\n\nThis is where most of the engineering effort went. The compositing pipeline:\n\n```\nFor each shot:\n  1. Generate AI video (Wan2.7-t2v / CogVideoX) → or fallback to Ken Burns\n  2. Mix voiceover audio → sync with video\n  3. Speed-ramp video to match audio duration\n  4. Apply fade-in/fade-out\n  5. Generate SRT subtitles\n\nThen:\n  6. Concat all shot videos → episode\n  7. Burn subtitles into final video\n  8. Upload to COS\n  9. Generate share URL\n```\n\nWhen AI video generation is disabled (or fails), we fall back to static images with camera motion. FFmpeg's `zoompan`\n\nfilter creates 10 different effects:\n\n```\n# Example: slow zoom-in with fade\nffmpeg -y -loop 1 -i \"image.jpg\" -i \"audio.mp3\" \\\n  -filter_complex \"\n    [0:v]scale=1920:1080:force_original_aspect_ratio=decrease,\n          pad=1920:1080:(ow-iw)/2:(oh-ih)/2:color=black,\n          zoompan=z='min(zoom+0.002,1.8)':d=240:s=1920x1080:fps=24,\n          fade=t=in:st=0:d=0.4,\n          fade=t=out:st=${(dur-0.4).toFixed(1)}:d=0.4[v];\n    [1:a]adelay=1|1[a]\n  \" -map \"[v]\" -map \"[a]\" -c:v libx264 -crf 20 -preset fast -y \"output.mp4\"\n```\n\n10 camera effects: zoom-in, zoom-out, pan-left, pan-right, pan-up, pan-down, zoom-in-left, zoom-in-right, zoom-out-left, zoom-out-right. Each shot picks one in round-robin, making static images feel dynamic.\n\nAI video generation has a ~10-15% failure rate (API timeouts, rate limits, content filters). The original pipeline simply showed a black screen for failed shots. The fix was a three-tier defense:\n\n```\nTier 1: AI video generation → 80% success rate\nTier 2: Ken Burns zoompan → ~15% (catches most failures)\nTier 3: Static image + text overlay → 100% reliability\n```\n\n**Tier 2 was the most fragile** — the zoompan filter chain with complex expressions often failed on edge cases (very long/short audio, special characters in subtitles). Tier 3 uses ffmpeg's `drawtext`\n\nto render the subtitle over a dark gradient background — it literally never fails.\n\nWe abstracted the entire end-to-end flow into a single CLI command:\n\n```\nnpx tsx scripts/full-pipeline.ts\n```\n\nThis reads the drama metadata from PostgreSQL, iterates over every episode and shot, runs the 4-stage pipeline, and uploads everything to cloud storage. The same script powers the web app's backend API.\n\n**Before (4 AI video generations):** ~20 minutes per episode (5 min/shot × 4 shots + 1080p encoding)\n\n**After (Ken Burns only):** ~2 minutes per episode\n\nEarly on, I chased the best AI video models (CogVideoX, Kling, Jimeng). But they all have ~10% failure rates, and a single failed shot ruins the entire episode. Investing in a **rock-solid fallback chain** that guaranteed no black screens was worth more than a 10% quality bump.\n\nNo single AI provider covers everything. We use:\n\nThe model config system lets users bring their own API keys for any stage.\n\nAI content generation is getting fast and cheap. The bottleneck is now ffmpeg. A 12-shot episode with Ken Burns processing takes ~3 minutes just for the encoding. If you're building an AI video platform, **invest in your compositing pipeline first**.\n\nHere's a 5-episode drama generated entirely by the pipeline:\n\n| Episode | Duration | Size | Sample |\n|---|---|---|---|\n| Episode 1 | 26s | 5.8MB |\n|\n\nThe project is open-source under MIT:\n\n```\ngit clone https://github.com/ycbing/Shortify-AI.git\ncd Shortify-AI\nnpm install\n# Configure .env.local with at least DATABASE_URL and GLM_API_KEY\nnpm run db:push\nnpm run dev\n```\n\nOr run the full pipeline directly:\n\n```\nnpx tsx scripts/full-pipeline.ts\n```\n\n**You only need a GLM API key** to get started — everything else has free tiers or falls back gracefully.\n\n*Built with Next.js 16, FFmpeg, PostgreSQL, and a lot of API calls. Star on GitHub if you found this interesting.*", "url": "https://wpnews.pro/news/i-built-an-ai-studio-that-turns-1-line-of-text-into-a-5-minute-short-drama", "canonical_source": "https://dev.to/ycbing/i-built-an-ai-studio-that-turns-1-line-of-text-into-a-5-minute-short-drama-j01", "published_at": "2026-06-24 08:52:21+00:00", "updated_at": "2026-06-24 09:13:42.341284+00:00", "lang": "en", "topics": ["artificial-intelligence", "generative-ai", "large-language-models", "developer-tools", "ai-tools"], "entities": ["Shortify AI", "GLM-4-Flash", "DeepSeek", "Qwen", "Wan2.7-image", "CogView-3-Plus", "iFlytek", "Edge-TTS"], "alternates": {"html": "https://wpnews.pro/news/i-built-an-ai-studio-that-turns-1-line-of-text-into-a-5-minute-short-drama", "markdown": "https://wpnews.pro/news/i-built-an-ai-studio-that-turns-1-line-of-text-into-a-5-minute-short-drama.md", "text": "https://wpnews.pro/news/i-built-an-ai-studio-that-turns-1-line-of-text-into-a-5-minute-short-drama.txt", "jsonld": "https://wpnews.pro/news/i-built-an-ai-studio-that-turns-1-line-of-text-into-a-5-minute-short-drama.jsonld"}}