{"slug": "lessons-from-a-weekend-building-local-ai-workflows", "title": "Lessons from a weekend building local AI workflows", "summary": "A developer built a multi-agent video editor that uses speech-to-text and AI agents to shorten videos by removing unnecessary content, but the first iteration produced poor results. The project revealed three key challenges: the \"lost-in-the-middle\" problem where AI models struggle to process information from the middle of long transcripts, a bias compound issue, and the limitations of Whisper speech recognition. The tool is available on GitHub but is not production-ready.", "body_md": "# Lessons from a weekend building local AI workflows\n\nLike everyone and their grandmother, these days I am into Agents! I finally got to spend some time learning more about multi-agent workflows: I came up with a simple use case, built a first iteration and watched it shatter against the messy reality. Then I learned a few things.\n\nThis post shares three things learned: lost-in-the-middle, the bias compound problem, and that Whisper isn’t a silver bullet.\n\nThe tool I built sort of works and is available on [GitHub](https://github.com/StefanoPetrilli/MultiAgentVideoEditor). The whole thing is a multi-agent video editor which takes a video and outputs a shortened down version by removing all the fluff so just the juicy parts remain.\n\nDon’t expect production ready magic, but I find it pretty entertaining :).\n\n## Naive solution\n\nThe first naive solution that came to my mind is the following:\n\n``` php\ngraph TD\n    A[Initial Video] -->|\"raw video\"| B[Speech To Text]\n    B -->|\"full transcript\"| C[Editor Agent]\n    B -->|\"full transcript\"| D[Reviewer Agent]\n    C -->|\"proposed cuts\"| D\n    D -.->|\"❌ Rejected: retry\"| C\n    D -->|\"✅ Accepted: cut list\"| E[Video Editing Agent]\n    E -->|\"stitched video\"| F[Final Video]\n```\n\nThe plan: take a video, run it through a speech-to-text model to get the transcription, feed the full video transcript into an editor agent that decides what the most important segments are, then feed the full transcript and the selected segments to a Reviewer Agent tasked with deciding whether the selected sections of the video actually preserve the message. In this plan, the editor agent and the reviewer agent would go back and forth until the reviewer agent agrees with the selection made by the editor agent. Finally, FFmpeg stitches the final video together.\n\nOn paper? Flawless. In reality? The output looked terrible 🥹.\n\nYou can look at it yourself:\n\n### Original\n\n### First iteration version\n\nThe rest of the post is about what went wrong and what I learned.\n\n## Lessons learned:\n\n### Loss-in-the-middle\n\nA 2024 paper, [ Lost in the Middle: how Language Models Use Long Contexts](https://arxiv.org/pdf/2307.03172), documents that models oversample the beginning and the end of their context window and are less efficient at retrieving information from the middle of their context window.\n\nWhat this paper formally proves won’t surprise the OG ChatGPT 3.5 users who, in one way or another, already experienced this firsthand.\n2026 is a different geological era in comparison to 2024 in the LLM world and this defect became much less noticeable as models became better and can juggle longer context windows. Still, *Lost-in-the-middle* is inherent to transformer architectures so the problem remains.\n\nIt’s also difficult to report on more recent literature on this topic. LLMs aren’t a moving target, they’re a running target. Every finding achieved might be obsolete the moment a new model generation comes out.\nThe most recent literature found on the topic comes from the paper [ LongFuncEval: Measuring the effectiveness of long context models for function calling](https://arxiv.org/pdf/2505.10570) where appendix F is entirely dedicated to measuring this on the SOTA of May 2025. Empirically, the lost-in-the-middle is still here and kicking, at least with the model families tested on this project: DeepSeek V4, Qwen 3.7, and GLM 5.\n\nThe editor agent from the workflow is the perfect storm for *lost-in-the-middle*. The videos tested on the workflow are quite long. Often, the real theme hides under a pile of fluff and exactly in the areas where the models are less sensitive: around the middle.\n\nOften the creator makes a short summary of the content at the beginning of the video. So the LLM, which by design oversamples that part, easily decides that the introductory summary is everything the user needs to know. Often the opposite is actually true and initial summary brings very little value and the middle is the juicy part that interests the user.\n\nThis resulted in the editor agent always oversampling the introduction or the end of the video.\nThe solution was to modify the architecture to add one more node in the workflow. The new agent receives the whole transcript and finds the core message from it. Then the agent passes that along to the editor and to the reviewer in the format of [core message] + [full transcript] + [core message]. This idea came from reading the original *lost-in-the-middle* paper.\n\nI had zero expectation for it to work but surprisingly the agents stopped over sampling the beginning of the videos.\n\n### The compound bias problem:\n\nThe initial assumption for the workflow was that the Editor and the Reviewer would debate and iterate before coming to an agreement. What really happened is that the reviewer agent acted as a rubber stamper. It was basically always approving the findings of the editor.\n\nI peeked at the literature and what I discovered is elegantly summarized by this quote: “LLMs’ inherent sycophancy can collapse debates into premature consensus, potentially undermining the benefits of multi-agent debate. Sycophancy is a core failure mode that amplifies disagreement collapse before reaching a correct conclusion” which comes from the paper [ Peacemaker or Troublemaker: how Sycophancy Shapes Multi-Agent Debate](https://arxiv.org/pdf/2509.23055).\nThe other paper consulted on the topic is\n\n[which has a much more mathematical perspective on the matter. My background knowledge isn’t sufficient to judge whether the mathematical aspect of the paper is sound, but the core argument seems sound: when evaluator error couples with generator error, self-evaluation becomes non-identifying and agreement provides negligible evidence of correctness. This is a rabbit hole on its own. It could use its own blog post.](https://www.techrxiv.org/doi/full/10.36227/techrxiv.176834656.66652387/v2)\n\n*Limits of Self-Correction in LLMs: an Information-Theoretic Analysis of Correlated Errors*A straightforward solution to this was to use a different LLM model family for Editor and Reviewer agent. Basically, the biases of one model are just compounded if it’s asked to judge the output of another instance of itself. When the models are different, the biases balance out. From my experience when using DeepSeek V4 Flash for both the Editor and Reviewer, the reviewer never rejected the first proposal. As soon as I switched the reviewer to a different model, the reviewer started rejecting the first proposal.\n\n### Whisper isn’t a silver bullet\n\nBecause Whisper it’s on everyone’s lips, I was under the assumption that it would be the best model for my task.\n\nTraining for Whisper models uses massive amounts of unsupervised data and much of this data comes from internet videos with subtitles, as explained in the [original Whisper paper](https://arxiv.org/pdf/2212.04356). It’s a known opinion in the LLM communicty that training on this kind of data conditioned this model to chop text based on the visual constraints of a screen and acoustic pauses, rather than grammatical boundaries.\n\nI also discovered that Whisper models are notoriously weak at timestamping the sentences they transcribe. Having timestamps which aren’t perfectly aligned in my workflow often resulted in chopped words. Another consequence is that the speech to text would sometimes split a single logical sentence in two parts if the speaker took a breath mid-sentence or paused for emphasis.\n\nUsing [WhisperX](https://arxiv.org/abs/2303.00747) solves some of Whisper’s weak points. WhisperX integrates Whisper into a longer pipeline and results in better timestamping and sentence splitting. Because it didn’t integrate easily with my stack and seemed a bit tricky to set up, ultimately my choice felt on [Vosk](https://alphacephei.com/vosk/). Empirically, the Vosk models produces output that’s qualitatively similar to Whisper while using Acoustic Alignment for better timestamping and Voice Activity Detection to split the sentences in a reasonable way.\n\nAfter hearing wonderful things about Whisper for months, it was quite a surprise that it was swiftly beaten in this specific use case by an underdog previously unheard of.\n\n### Reworked architecture\n\nBeaten up but not defeated, this is the resulting architecture after the changes.\n\n``` php\ngraph TD\n    A[Initial Video] -->|\"raw video\"| B[Speech To Text]\n    B -->|\"full transcript\"| C[Topic Agent]\n    C -->|\"[core message] + transcript\"| D[Editor Agent]\n    C -->|\"[core message] + transcript\"| E[Reviewer Agent]\n    D -->|\"proposed cuts\"| E\n    E -.->|\"❌ Rejected: retry\"| D\n    E -->|\"✅ Accepted: cut list\"| F[Video Editing]\n    F -->|\"stitched video\"| G[Final Video]\n```\n\nAnd this is the resulting video:\n\n### Original\n\n### Version after the fixes\n\nThis one is much better and does a great job at preserving the main narrative.", "url": "https://wpnews.pro/news/lessons-from-a-weekend-building-local-ai-workflows", "canonical_source": "http://stefano.petrilli.xyz/building-ai-workflows/", "published_at": "2026-06-06 10:33:49+00:00", "updated_at": "2026-06-06 11:18:15.356429+00:00", "lang": "en", "topics": ["ai-agents", "generative-ai", "ai-tools", "natural-language-processing", "large-language-models"], "entities": ["Whisper", "GitHub", "MultiAgentVideoEditor", "StefanoPetrilli"], "alternates": {"html": "https://wpnews.pro/news/lessons-from-a-weekend-building-local-ai-workflows", "markdown": "https://wpnews.pro/news/lessons-from-a-weekend-building-local-ai-workflows.md", "text": "https://wpnews.pro/news/lessons-from-a-weekend-building-local-ai-workflows.txt", "jsonld": "https://wpnews.pro/news/lessons-from-a-weekend-building-local-ai-workflows.jsonld"}}