I Built an AI Studio That Turns 1 Line of Text Into a 5-Minute Short Drama A developer built Shortify AI, an open-source platform that converts a single line of text into a complete 1080p short drama episode. The pipeline uses AI for scriptwriting, storyboarding, multi-voice dubbing, and video compositing, compressing a process that normally takes 5-10 people and 1-2 weeks into a single person and 5 minutes. From idea to 1080p video — AI scriptwriting, storyboarding, multi-voice dubbing, and video compositing in one pipeline. I spent the past month building Shortify AI https://github.com/ycbing/Shortify-AI , an open-source platform that takes a creative prompt like "a time-traveling maid from ancient China lands in a modern office" and outputs a full short drama episode — script, illustrated storyboard, multi-character voiceover, and a complete 1080p video. Here's how it works under the hood, the architecture decisions I made, and the full pipeline that ties it together. China's short drama market hit $70B in 2025. These are vertical, fast-paced 1–5 minute episodes — basically TikTok meets TV series. The production pipeline is: Total: 5–10 people, 1–2 weeks per episode. I wanted to compress this to: 1 person, 5 minutes. Here's the end-to-end pipeline: User Input "穿越到现代的女将军" │ ▼ ┌─────────────────────┐ │ AI Scriptwriter │ ← GLM-4-Flash / DeepSeek / Qwen │ characters + │ │ scenes + shots │ └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ Storyboard Images │ ← Wan2.7-image / CogView-3-Plus │ 1 image per shot │ └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ Multi-voice TTS │ ← iFlytek WebSocket / Edge-TTS │ male / female / │ │ narrator per role │ └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ Video Compositing │ ← FFmpeg Ken Burns + AI video │ 1080p, subtitles, │ │ background music │ └─────────┬───────────┘ │ ▼ COS Storage + Share URL Each stage runs independently and can be swapped — more on that below. The LLM acts as a screenwriting assistant. Given a creative prompt, it generates a structured script with: The prompt engineering was the hardest part. Early versions produced flat narration-style scripts. The breakthrough was switching to dialogue-centric formatting : js // Simplified prompt structure const systemPrompt = Generate a short drama script in this format: { "characters": { "name": "...", "gender": "male|female", "voiceId": "..." } , "shots": { "shotNumber": 1, "character": "...", "dialogue": "...", "sceneDescription": "...", "cameraDirection": "close-up|wide|over-shoulder" } } Rules: - Each shot = one line of dialogue - Include scene descriptions for every shot - Mark characters explicitly so we can assign voice models - Total length: 7-12 shots per episode ; API: GLM-4-Flash free tier , but any OpenAI-compatible LLM works via our model resolver layer. For each shot, we generate an illustration. The challenge wasn't the image generation itself — it was managing cost and consistency . Wan2.7-image DashScope, ~$0.03/image → 2K resolution, synchronous └── fallback → Wanx-v1 older, cheaper → 720p, async polling └── fallback → CogView-3-Plus Zhipu → fallback with different API For character consistency across shots, we inject appearance descriptors into every prompt: js function buildAppearancePrompt shot, characters : string { const shotChars = characters.filter c = shot.character === c.name ; const appearanceDesc = shotChars .map c = ${c.name}: ${c.appearance} .join ". " ; return ${shot.sceneDescription}. ${appearanceDesc}. Cinematic lighting, 16:9 widescreen, photorealistic. ; } Images are uploaded to Tencent Cloud COS private bucket with signed URLs for secure access. This was the most fun to build. The LLM script tells us which character speaks in each shot, and we assign voice IDs accordingly: js const voiceMap: Record