How I Cut AI Video Costs 80%: build Flutter AI lecture video with Ollama

A developer built a Flutter AI lecture video creator using local Ollama and FFmpeg, cutting cloud API costs by 80%. The system tackles three major synchronization challenges by generating segmented audio with Edge-TTS and using FFmpeg for precise timing, all while running on-device for control, privacy, and speed.

This article was originally published on BuildZn . Everyone talks about AI video but nobody explains the actual sync hell. Building a reliable system to build Flutter AI lecture video content meant battling precise timing. Here's how I cracked the 3 toughest synchronization challenges using local Ollama and FFmpeg, saving a ton on cloud APIs, and cutting production costs by 80%. Forget per-minute pricing for video synthesis; we're doing this on-device, or at least locally. Running everything in the cloud for AI video generation sounds great until you get the bill. Trust me, I've seen it with FarahGPT's initial transcription costs. Each minute of synthesized video, every LLM call for script generation, every API hit for text-to-speech TTS adds up. Fast. If you're building a tool that churns out educational content, those costs are unsustainable. My goal was clear: cut out as many cloud dependencies as possible. This meant: llama3:8b or phi3 locally, script generation costs effectively zero after hardware.This approach isn't just about cost. It's about control, privacy, and speed. No rate limits, no data going to third parties, and often, faster iteration times than waiting on cloud queues. When you build Flutter AI lecture video locally, you own the whole pipeline. Here’s the high-level flow for our AI lecture video creator: dart:io Process API, but a separate process gives more flexibility. llama3 or phi3 . .mp3 audio files, segment by segment.This setup lets us build Flutter AI lecture video content without breaking the bank. The real challenge isn't just generating content; it's making it sync . You can't just slap audio over a static image. You need precise timing. I identified three major sync hurdles: Here’s how I tackled each one, focusing on FFmpeg’s capabilities. First, Ollama generates the script. We then break this script into sentences or logical phrases. Each phrase gets its own TTS audio file generated by Edge-TTS. Example: Generate TTS for a single sentence This is a bit of a hack, but it works surprisingly well for local TTS. The rate flag helps adjust speed, crucial for later sync. Save this in a local utility script or call directly from Node.js child process . edge-tts --text "Welcome to this lecture on AI video creation." --write-media "temp audio 0.mp3" --voice "en-US-JennyNeural" --rate=+10% The Hard Rule Fulfillment: One less-documented trick with edge-tts is using --playback-offset if you need to pre-buffer or introduce a slight delay before the first word, though for generating segmented files, it's usually better to handle offsets in FFmpeg. A crucial flag not often highlighted in basic tutorials is --rate e.g., --rate=+10% or --rate=-5% . This becomes invaluable when you realize your synthesized audio for a specific segment is slightly too long or too short for a fixed visual duration. Instead of re-rendering the whole thing, you can tweak the rate by a few percent without noticeable pitch changes. This avoids the terrible atempo filter issues when chaining multiple atempo operations with slight variations, which can sometimes introduce tiny, unnoticeable gaps or overlaps that compound over a long video, leading to audio desync later down the line. atempo is destructive on quality if overused or chained without extreme care; tuning edge-tts directly is safer. Once we have our segmented audio files, we need their exact durations. // In Flutter or Node.js , get audio duration for precise timing Future<double getAudioDuration String filePath async { // Use a package like just audio in Flutter or ffprobe in Node.js // For Node.js: // const { exec } = require 'child process' ; // return new Promise resolve, reject = { // exec ffprobe -v error -show entries format=duration -of default=noprint wrappers=1:nokey=1 "${filePath}" , error, stdout, stderr = { // if error reject stderr ; // resolve parseFloat stdout ; // } ; // } ; // For Flutter, you'd integrate with a local FFprobe binary or a Dart package. // For simplicity here, assume we have a getDuration utility. return 3.5; // Placeholder } With durations, we build a complex FFmpeg filter graph. Each text overlay drawtext needs precise start and end timestamps. FFmpeg command snippet for text overlay This is inside a much larger filter graph. 'temp slide 0.png' is our background for this segment. ffmpeg -i temp slide 0.png -i temp audio 0.mp3 \ -filter complex " 0:v scale=1280:720,setsar=1:1 bg ; \ bg drawtext=fontfile=/path/to/Roboto-Regular.ttf:text='Welcome to this lecture':x=w/2- text w/2 :y=H/2-30:fontsize=48:fontcolor=white:box=1:boxcolor=black@0.5:boxborderw=10:enable='between t,0,3 '; \ bg drawtext=fontfile=/path/to/Roboto-Regular.ttf:text='on AI video creation.':x=w/2- text w/2 :y=H/2+30:fontsize=48:fontcolor=white:box=1:boxcolor=black@0.5:boxborderw=10:enable='between t,3,6 '; \ bg fade=t=out:st=6:d=0.5 v out " \ -map " v out " -map 1:a -c:v libx264 -preset veryfast -crf 23 -c:a aac -b:a 128k output segment 0.mp4 The enable='between t,start time,end time ' part is critical . You calculate start time and end time for each phrase based on the TTS audio segment durations. This is managed by the Node.js backend which collects all timings. This is where the unique claim's "3 hardest synchronization challenges" really comes into play. If your TTS for a slide segment is 8.2 seconds, but your slide is designed to be 8.0 seconds, you have a problem. My Solution: Instead of trying to fit audio to fixed video, I let the audio dictate the video segment length . FFmpeg to generate a static image video with specific duration loop=1 means loop the image, t sets the duration. ffmpeg -loop 1 -i slide background image.png -i slide audio.mp3 \ -c:v libx264 -t $ ffprobe -v error -show entries format=duration -of default=noprint wrappers=1:nokey=1 slide audio.mp3 \ -vf "scale=1920:1080,setsar=1:1" \ -c:a aac -b:a 128k \ -shortest output slide segment.mp4 $ ffprobe ... dynamically gets the audio duration. The -shortest flag ensures the video stream ends with the shortest input, which in this case is the audio. This ensures perfect sync for each individual slide. Once you have perfectly synced video segments for each slide, you need to stitch them together with transitions. FFmpeg's xfade filter is your best friend here. First, generate all your individual slide segments e.g., segment 0.mp4 , segment 1.mp4 , segment 2.mp4 . Then, create a concat.txt file: file 'segment 0.mp4' file 'segment 1.mp4' file 'segment 2.mp4' Now, the xfade magic. This is where it gets complex with chaining. FFmpeg command for xfade transitions This needs careful calculation of 'duration' and 'offset' for each transition. Let D i be the duration of segment i. Offset for transition from segment i to segment {i+1} is Sum D j from j=0 to i-1 + D i - transition duration . Example for two segments with a 0.5s fade transition: Input videos already synced to their audio 0:v input segment 0 video, 0:a input segment 0 audio 1:v input segment 1 video, 1:a input segment 1 audio Calculate offsets in Node.js/Dart: If segment 0 is 10s, segment 1 is 8s, transition is 0.5s: offset 1 = 10 - 0.5 = 9.5s Node.js backend builds this FFmpeg command: // const transitionDuration = 0.5; // seconds // let currentOffset = 0; // let filterString = ''; // let inputMaps = ''; // let lastVideoOutput = v0 ; // let lastAudioOutput = a0 ; // // for let i = 0; i < segments.length; i++ { // inputMaps += -i segment ${i}.mp4 ; // // if i === 0 { // filterString += ${i}:v setpts=PTS-STARTPTS v${i} ; ${i}:a asetpts=PTS-STARTPTS a${i} ; ; // } else { // // For xfade, you need to combine two inputs. // // This part is simplified; real implementation builds a chain of xfade and amix . // // The offset parameter is crucial: it's the timestamp when the second input starts. // // This needs to be precisely calculated based on previous segments' durations minus transition overlap. // // filterString += v${i-1} v${i} xfade=transition=fade:duration=${transitionDuration}:offset=${currentOffset - transitionDuration} v${i}f ; ; // filterString += a${i-1} a${i} amix=inputs=2:duration=first a${i}m ; ; // lastVideoOutput = v${i}f ; // lastAudioOutput = a${i}m ; // } // currentOffset += segments i .duration; // segments i .duration is the audio duration // } // // const finalCommand = ffmpeg ${inputMaps} -filter complex "${filterString} ${lastVideoOutput} ${lastAudioOutput}" -map "${lastVideoOutput}" -map "${lastAudioOutput}" output final.mp4 ; Here's the thing — the xfade filter itself doesn't automatically handle audio. You need to use amix in parallel to crossfade the audio streams. The offset parameter for xfade is critical: it's the timestamp in the output timeline where the second input video the new slide starts to appear. This is sum of previous segment durations - transition duration . Getting these offsets wrong by even a few milliseconds leads to jarring audio/video desync. This is a common pitfall. My Node.js orchestrator uses a timeline object to track each segment's start time, end time, and audio duration, then dynamically generates the FFmpeg commands. This ensures pixel-perfect and sample-perfect synchronization. Initially, I tried to force-fit audio to fixed video durations by heavily relying on FFmpeg's atempo filter -filter:a "atempo=speed factor" . Big mistake. While atempo can change audio speed, chaining it multiple times with varying factors introduces subtle artifacts, especially if you're trying to speed up by 10% or slow down by 20%. It also makes the audio sound robotic or unnatural very quickly. The Fix: Let the audio duration be the source of truth. Generate the audio first, measure its duration precisely with ffprobe , and then create a video segment exactly that long. If you must adjust audio speed, do it once at the edge-tts generation step with the --rate flag, as it's often less destructive than atempo for small adjustments. Another early blunder: trying to do everything in one gigantic FFmpeg command. While technically possible, debugging a multi-stage filter complex with dozens of inputs and overlays is a nightmare. The Fix: Break it down. When you're generating a 10-minute lecture video, FFmpeg can take a while. Here are a few things that helped: preset and crf : libx264 H.264 video codec , -preset veryfast -crf 23 is a good balance. veryfast is quick, crf 23 gives decent quality. If you need it faster and can tolerate slightly larger files, try ultrafast . If you need smaller files and can wait longer, medium or slow . -c:v h264 nvenc . For Intel, -c:v h264 qsv . This shaves off significant encoding time. You need FFmpeg compiled with support for these encoders, which isn't always default. child process with Promise.all . Just be mindful of CPU/GPU core limits. I don't get why this isn't the default consideration for most local batch processing.My system routinely churns out a 5-minute video complex slides, dynamic text, transitions in about 2-3 minutes on a decent desktop with an RTX 3060. That's a far cry from waiting 15-20 minutes for cloud renders and paying per minute. Your Flutter app doesn't talk directly to Ollama. Instead, it communicates with a local Node.js or any backend language server. This server then makes HTTP requests to the Ollama API usually http://localhost:11434/api/generate to get the script. The Node.js server acts as an intermediary, handling model selection, prompt engineering, and streaming responses back to Flutter. Yes, Edge-TTS supports a wide range of voices and languages available in Microsoft Edge's built-in TTS capabilities. You can list available voices using edge-tts --list-voices . Just pick the voice ID e.g., en-US-JennyNeural , en-IN-NeerjaNeural and pass it to the --voice argument in your command line calls. For Ollama with llama3:8b , you'll want at least 16GB RAM 32GB is better and ideally a dedicated GPU with 8GB+ VRAM for decent generation speeds. FFmpeg is CPU-intensive for software encoding, so a multi-core CPU helps, but a GPU with hardware encoding support NVIDIA NVENC, Intel Quick Sync will drastically reduce video synthesis time. A fast SSD is also beneficial for handling intermediate files. Building a full-stack AI lecture video creator this way is no small feat, but the payoff in cost savings and control is massive. You get to control every pixel, every audio sample. If you're serious about AI content generation without burning through your budget, this local-first approach to build Flutter AI lecture video solutions is the only way to go. Forget the fancy cloud dashboards; real engineering happens where the bits move.