This article was originally published on[BuildZn].
Everyone talks about AI video but nobody explains the actual sync hell. Building a reliable system to build Flutter AI lecture video content meant battling precise timing. Here's how I cracked the 3 toughest synchronization challenges using local Ollama and FFmpeg, saving a ton on cloud APIs, and cutting production costs by 80%. Forget per-minute pricing for video synthesis; we're doing this on-device, or at least locally.
Running everything in the cloud for AI video generation sounds great until you get the bill. Trust me, I've seen it with FarahGPT's initial transcription costs. Each minute of synthesized video, every LLM call for script generation, every API hit for text-to-speech (TTS) adds up. Fast. If you're building a tool that churns out educational content, those costs are unsustainable.
My goal was clear: cut out as many cloud dependencies as possible. This meant:
llama3:8b
or phi3
locally, script generation costs effectively zero after hardware.This approach isn't just about cost. It's about control, privacy, and speed. No rate limits, no data going to third parties, and often, faster iteration times than waiting on cloud queues. When you build Flutter AI lecture video locally, you own the whole pipeline.
Here’s the high-level flow for our AI lecture video creator:
dart:io
Process
API, but a separate process gives more flexibility.llama3
or phi3
..mp3
audio files, segment by segment.This setup lets us build Flutter AI lecture video content without breaking the bank.
The real challenge isn't just generating content; it's making it sync. You can't just slap audio over a static image. You need precise timing. I identified three major sync hurdles:
Here’s how I tackled each one, focusing on FFmpeg’s capabilities.
First, Ollama generates the script. We then break this script into sentences or logical phrases. Each phrase gets its own TTS audio file generated by Edge-TTS.
edge-tts --text "Welcome to this lecture on AI video creation." --write-media "temp_audio_0.mp3" --voice "en-US-JennyNeural" --rate=+10%
The Hard Rule Fulfillment:
One less-documented trick with edge-tts
is using --playback-offset
if you need to pre-buffer or introduce a slight delay before the first word, though for generating segmented files, it's usually better to handle offsets in FFmpeg. A crucial flag not often highlighted in basic tutorials is --rate
(e.g., --rate=+10%
or --rate=-5%
). This becomes invaluable when you realize your synthesized audio for a specific segment is slightly too long or too short for a fixed visual duration. Instead of re-rendering the whole thing, you can tweak the rate by a few percent without noticeable pitch changes. This avoids the terrible atempo
filter issues when chaining multiple atempo
operations with slight variations, which can sometimes introduce tiny, unnoticeable gaps or overlaps that compound over a long video, leading to audio desync later down the line. atempo
is destructive on quality if overused or chained without extreme care; tuning edge-tts
directly is safer.
Once we have our segmented audio files, we need their exact durations.
// In Flutter (or Node.js), get audio duration for precise timing
Future<double> getAudioDuration(String filePath) async {
// Use a package like `just_audio` in Flutter or `ffprobe` in Node.js
// For Node.js:
// const { exec } = require('child_process');
// return new Promise((resolve, reject) => {
// exec(`ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 "${filePath}"`, (error, stdout, stderr) => {
// if (error) reject(stderr);
// resolve(parseFloat(stdout));
// });
// });
// For Flutter, you'd integrate with a local FFprobe binary or a Dart package.
// For simplicity here, assume we have a `getDuration` utility.
return 3.5; // Placeholder
}
With durations, we build a complex FFmpeg filter graph. Each text overlay (drawtext
) needs precise start
and end
timestamps.
ffmpeg -i temp_slide_0.png -i temp_audio_0.mp3 \
-filter_complex "[0:v]scale=1280:720,setsar=1:1[bg]; \
[bg]drawtext=fontfile=/path/to/Roboto-Regular.ttf:text='Welcome to this lecture':x=w/2-(text_w/2):y=H/2-30:fontsize=48:fontcolor=white:box=1:boxcolor=black@0.5:boxborderw=10:enable='between(t,0,3)'; \
[bg]drawtext=fontfile=/path/to/Roboto-Regular.ttf:text='on AI video creation.':x=w/2-(text_w/2):y=H/2+30:fontsize=48:fontcolor=white:box=1:boxcolor=black@0.5:boxborderw=10:enable='between(t,3,6)'; \
[bg]fade=t=out:st=6:d=0.5[v_out]" \
-map "[v_out]" -map 1:a -c:v libx264 -preset veryfast -crf 23 -c:a aac -b:a 128k output_segment_0.mp4
The enable='between(t,start_time,end_time)'
part is critical. You calculate start_time
and end_time
for each phrase based on the TTS audio segment durations. This is managed by the Node.js backend which collects all timings.
This is where the unique claim's "3 hardest synchronization challenges" really comes into play. If your TTS for a slide segment is 8.2 seconds, but your slide is designed to be 8.0 seconds, you have a problem.
My Solution:
Instead of trying to fit audio to fixed video, I let the audio dictate the video segment length.
ffmpeg -loop 1 -i slide_background_image.png -i slide_audio.mp3 \
-c:v libx264 -t $(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 slide_audio.mp3) \
-vf "scale=1920:1080,setsar=1:1" \
-c:a aac -b:a 128k \
-shortest output_slide_segment.mp4
$(ffprobe ...)
dynamically gets the audio duration. The -shortest
flag ensures the video stream ends with the shortest input, which in this case is the audio. This ensures perfect sync for each individual slide.
Once you have perfectly synced video segments for each slide, you need to stitch them together with transitions. FFmpeg's xfade
filter is your best friend here.
First, generate all your individual slide segments (e.g., segment_0.mp4
, segment_1.mp4
, segment_2.mp4
). Then, create a concat.txt
file:
file 'segment_0.mp4'
file 'segment_1.mp4'
file 'segment_2.mp4'
Now, the xfade
magic. This is where it gets complex with chaining.
// const transitionDuration = 0.5; // seconds
// let currentOffset = 0;
// let filterString = '';
// let inputMaps = '';
// let lastVideoOutput = `[v0]`;
// let lastAudioOutput = `[a0]`;
//
// for (let i = 0; i < segments.length; i++) {
// inputMaps += `-i segment_${i}.mp4 `;
//
// if (i === 0) {
// filterString += `[${i}:v]setpts=PTS-STARTPTS[v${i}];[${i}:a]asetpts=PTS-STARTPTS[a${i}];`;
// } else {
// // For xfade, you need to combine two inputs.
// // This part is simplified; real implementation builds a chain of `xfade` and `amix`.
// // The `offset` parameter is crucial: it's the timestamp when the second input starts.
// // This needs to be precisely calculated based on previous segments' durations minus transition overlap.
//
// filterString += `[v${i-1}][v${i}]xfade=transition=fade:duration=${transitionDuration}:offset=${currentOffset - transitionDuration}[v${i}f];`;
// filterString += `[a${i-1}][a${i}]amix=inputs=2:duration=first[a${i}m];`;
// lastVideoOutput = `[v${i}f]`;
// lastAudioOutput = `[a${i}m]`;
// }
// currentOffset += segments[i].duration; // segments[i].duration is the audio duration
// }
//
// const finalCommand = `ffmpeg ${inputMaps} -filter_complex "${filterString} ${lastVideoOutput} ${lastAudioOutput}" -map "${lastVideoOutput}" -map "${lastAudioOutput}" output_final.mp4`;
Here's the thing — the xfade
filter itself doesn't automatically handle audio. You need to use amix
in parallel to crossfade the audio streams. The offset
parameter for xfade
is critical: it's the timestamp in the output timeline where the second input video (the new slide) starts to appear. This is (sum of previous segment durations) - (transition duration)
. Getting these offsets wrong by even a few milliseconds leads to jarring audio/video desync. This is a common pitfall.
My Node.js orchestrator uses a timeline
object to track each segment's start time, end time, and audio duration, then dynamically generates the FFmpeg commands. This ensures pixel-perfect and sample-perfect synchronization.
Initially, I tried to force-fit audio to fixed video durations by heavily relying on FFmpeg's atempo
filter (-filter:a "atempo=speed_factor"
). Big mistake. While atempo
can change audio speed, chaining it multiple times with varying factors introduces subtle artifacts, especially if you're trying to speed up by >10% or slow down by >20%. It also makes the audio sound robotic or unnatural very quickly.
The Fix: Let the audio duration be the source of truth. Generate the audio first, measure its duration precisely with ffprobe
, and then create a video segment exactly that long. If you must adjust audio speed, do it once at the edge-tts
generation step with the --rate
flag, as it's often less destructive than atempo
for small adjustments.
Another early blunder: trying to do everything in one gigantic FFmpeg command. While technically possible, debugging a multi-stage filter_complex
with dozens of inputs and overlays is a nightmare.
The Fix: Break it down.
When you're generating a 10-minute lecture video, FFmpeg can take a while. Here are a few things that helped:
preset
and crf
:libx264
(H.264 video codec), -preset veryfast -crf 23
is a good balance. veryfast
is quick, crf 23
gives decent quality. If you need it faster and can tolerate slightly larger files, try ultrafast
. If you need smaller files and can wait longer, medium
or slow
.-c:v h264_nvenc
. For Intel, -c:v h264_qsv
. This shaves off significant encoding time. You need FFmpeg compiled with support for these encoders, which isn't always default.child_process
with Promise.all
. Just be mindful of CPU/GPU core limits. I don't get why this isn't the default consideration for most local batch processing.My system routinely churns out a 5-minute video (complex slides, dynamic text, transitions) in about 2-3 minutes on a decent desktop with an RTX 3060. That's a far cry from waiting 15-20 minutes for cloud renders and paying per minute.
Your Flutter app doesn't talk directly to Ollama. Instead, it communicates with a local Node.js (or any backend language) server. This server then makes HTTP requests to the Ollama API (usually http://localhost:11434/api/generate
) to get the script. The Node.js server acts as an intermediary, handling model selection, prompt engineering, and streaming responses back to Flutter.
Yes, Edge-TTS supports a wide range of voices and languages available in Microsoft Edge's built-in TTS capabilities. You can list available voices using edge-tts --list-voices
. Just pick the voice
ID (e.g., en-US-JennyNeural
, en-IN-NeerjaNeural
) and pass it to the --voice
argument in your command line calls.
For Ollama with llama3:8b
, you'll want at least 16GB RAM (32GB is better) and ideally a dedicated GPU with 8GB+ VRAM for decent generation speeds. FFmpeg is CPU-intensive for software encoding, so a multi-core CPU helps, but a GPU with hardware encoding support (NVIDIA NVENC, Intel Quick Sync) will drastically reduce video synthesis time. A fast SSD is also beneficial for handling intermediate files.
Building a full-stack AI lecture video creator this way is no small feat, but the payoff in cost savings and control is massive. You get to control every pixel, every audio sample. If you're serious about AI content generation without burning through your budget, this local-first approach to build Flutter AI lecture video solutions is the only way to go. Forget the fancy cloud dashboards; real engineering happens where the bits move.