{"slug": "transcribing-my-old-podcast-locally-with-open-source-ai", "title": "Transcribing my old podcast locally with open-source AI", "summary": "A podcaster transcribed ten hours of old interviews using open-source AI tools WhisperX and pyannote.audio, running entirely on a local laptop with no cloud costs. The process demonstrates how far speech-to-text and speaker diarization have advanced since 2016.", "body_md": "Back in 2016 and 2017 I recorded a podcast called [Syscast](/syscast/)\n: interviews with people I admired in the Linux, open source and infrastructure world. [Matt Holt about Caddy](/syscast/1-matt-holt-creator-caddy-webserver/)\n, [Daniel Stenberg about curl](/syscast/4-curl-libcurl-future-web-daniel-stenberg/)\n, [Seth Vargo about Vault](/syscast/3-managing-secrets-vault-seth-vargo/)\n, and a handful more. Ten episodes, roughly ten hours of audio, and then life got in the way and I put it on pause.\n\nThe one thing those episodes never had was transcripts. I always wanted them. Audio is nice, but you can’t search it, you can’t skim it, and Google can’t read it. The problem was that in 2016, transcribing ten hours of two-person interviews yourself just wasn’t realistic. Decent speech-to-text was a paid cloud service, and telling two speakers apart was basically a research project.\n\nIt’s 2026 now, so I did it in an evening, on my own laptop, with open-source models and no API bill. Here’s how.\n\n## The stack[#](#the-stack)\n\nTwo open-source pieces do the work:\n\n[WhisperX](https://github.com/m-bain/whisperX)wraps OpenAI’s[Whisper](https://github.com/openai/whisper)`large-v3`\n\nmodel for the actual speech-to-text, with word-level timestamps.[pyannote.audio](https://github.com/pyannote/pyannote-audio)handles the speaker diarization.\n\n“Diarization” was a new word to me when I started this. It’s the step that splits a recording up by speaker: *this stretch is one voice, this stretch is another*, without knowing who either of them is yet. Whisper writes down *what* gets said; pyannote works out *who* said it. Put the two together and a two-person interview reads as a real back-and-forth instead of one long undivided block.\n\nBoth run locally. The audio never leaves the machine and there’s no per-minute cost. The only thing you need from the outside world is a free Hugging Face account.\n\n## Gated models on Hugging Face[#](#gated-models-on-hugging-face)\n\nThe diarization models are gated on Hugging Face. No idea why. Are they dangerous and you need to sign a waver? Who knows. 🤷♂️\n\nAll I did was create a (free) account, a read token and clicked “agree” on the model pages before I could download them. I hadn’t first, and WhisperX greeted me with this:\n\n```\nCould not download 'pyannote/speaker-diarization-3.1' pipeline.\nIt might be because the pipeline is private or gated...\n```\n\nThat reads like a network or auth bug, but it just means the licence wasn’t accepted yet. If you try this, accept the conditions on `pyannote/speaker-diarization-3.1`\n\nand `pyannote/segmentation-3.0`\n\nfirst, drop your token in `~/.hf_token`\n\n, and it works.\n\n## The pipeline[#](#the-pipeline)\n\nSetup is a virtualenv and one install (I used [uv](https://github.com/astral-sh/uv)\n):\n\n```\nuv venv .venv-whisper\nuv pip install --python .venv-whisper whisperx\n```\n\nThe core is about a dozen lines of WhisperX. Load the model on CPU with int8 quantization (Apple Silicon has no usable CUDA path for this stack), transcribe, align for accurate word-level timestamps, then diarize and assign each word to a speaker:\n\n``` python\nimport whisperx\n\naudio = whisperx.load_audio(mp3_path)\n\n# 1. transcribe with Whisper large-v3\nmodel = whisperx.load_model(\"large-v3\", \"cpu\", compute_type=\"int8\", language=\"en\")\nresult = model.transcribe(audio, batch_size=1, language=\"en\")\n\n# 2. align, for accurate word-level timestamps\nalign_model, meta = whisperx.load_align_model(language_code=\"en\", device=\"cpu\")\nresult = whisperx.align(result[\"segments\"], align_model, meta, audio, \"cpu\")\n\n# 3. diarize (who spoke when), then tag each word with a speaker\ndiarize = whisperx.DiarizationPipeline(use_auth_token=hf_token, device=\"cpu\")\nresult = whisperx.assign_word_speakers(diarize(audio), result)\n```\n\nThat gives me a list of segments, each with a `speaker`\n\n, `start`\n\n, `end`\n\nand `text`\n\n. Batching all ten episodes is just a loop over the mp3s, logging the wall-clock time per file (that’s where the benchmark below comes from):\n\n```\nfor episode in static/podcast/episodes/*.mp3; do\n    start=$(date +%s)\n    .venv-whisper/bin/python scripts/transcribe-syscast.py \"$episode\"\n    printf '%s\\t%ss\\n' \"$(basename \"$episode\")\" \"$(( $(date +%s) - start ))\"\ndone\n```\n\nRaw WhisperX output is choppy: lots of short segments, `SPEAKER_00`\n\n/`SPEAKER_01`\n\nlabels (just whoever talked first and second), no paragraphs:\n\n```\n[0:00] SPEAKER_00: Welcome to a new episode of Syscast. My name is Mattias Geniar and today I'm joined by Seth Vargo from HashiCorp.\n[0:14] SPEAKER_01: Hey Mattias, I'm good. Doing well over here in Pittsburgh.\n```\n\nA small cleanup step merges consecutive segments from the same speaker into one turn, then splits long turns into paragraphs every few sentences:\n\n```\nturns = []\nfor seg in segments:\n    if turns and turns[-1][\"spk\"] == seg[\"speaker\"]:\n        turns[-1][\"text\"] += \" \" + seg[\"text\"].strip()   # same speaker, keep merging\n    else:\n        turns.append({\"spk\": seg[\"speaker\"], \"start\": int(seg[\"start\"]), \"text\": seg[\"text\"].strip()})\n```\n\nThe last touch is mapping `SPEAKER_00`\n\nto “Mattias” and `SPEAKER_01`\n\nto the guest (whose name is right there in the episode title), and fixing the obvious mis-hearings. Whisper was very confident my name is “Matthias Genjar”. 😁\n\n## The result[#](#the-result)\n\nThat lands on each episode page as a readable, speaker-labelled conversation with clickable timestamps to jump straight into the audio. For example, [Seth Vargo on Vault](/syscast/3-managing-secrets-vault-seth-vargo/)\nor [Jan-Piet Mens on Linux vs BSD](/syscast/9-linux-vs-bsd/)\n.\n\n## How long it actually took[#](#how-long-it-actually-took)\n\nFeasible doesn’t mean fast, though. `large-v3`\n\nplus diarization on a CPU (Apple Silicon, no usable GPU path for this stack) is slow. Per episode, on my own machine:\n\n| Episode | Audio | Transcribe time |\n|---|---|---|\n|\n\n[Nils De Moor, Docker](/syscast/2-docker-introduction-nils-de-moor/)[Daniel Stenberg, curl](/syscast/4-curl-libcurl-future-web-daniel-stenberg/)[James Cammarata, Ansible](/syscast/5-ansible-config-management-deploying-code-james-cammarata-red-hat/)[Scott Arciszewski, security](/syscast/6-application-security-cryptography-scott-arciszewski/)[Config Management Camp recap](/syscast/7-config-management-camp-kubernetes-sysdig-mgmt/)[Jan Somers, CPU wars](/syscast/8-intel-amd-arm-cpu-hardware-episode/)[Jan-Piet Mens, Linux vs BSD](/syscast/9-linux-vs-bsd/)**Total****13h49m** Call it roughly twice real-time, and about 14 hours of compute for the whole catalogue. I ran it overnight and the laptop got warm. If I wanted it done in minutes I’d have used a hosted API for a few dollars, but the point here was the opposite: can I do this myself, locally, for free? Yes.\n\n## Worth it?[#](#worth-it)\n\nFor ten old episodes that weren’t getting much traffic, the value isn’t the compute time. A decade of conversations with people like Matt and Daniel is now text: searchable, skimmable and indexable. The back catalogue gets a second life, and it cost me nothing but a warm laptop and a night of electricity.\n\nThe thing that sticks with me is the timeline. The exact task that was out of reach for one person in 2016 now runs, end to end, on the laptop in front of me. That keeps happening, and it’s worth noticing.", "url": "https://wpnews.pro/news/transcribing-my-old-podcast-locally-with-open-source-ai", "canonical_source": "https://ma.ttias.be/transcribing-syscast-with-local-ai/", "published_at": "2026-06-13 17:47:44+00:00", "updated_at": "2026-06-13 18:17:42.027399+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-tools", "developer-tools"], "entities": ["WhisperX", "OpenAI", "Whisper", "pyannote.audio", "Hugging Face", "Caddy", "curl", "Vault"], "alternates": {"html": "https://wpnews.pro/news/transcribing-my-old-podcast-locally-with-open-source-ai", "markdown": "https://wpnews.pro/news/transcribing-my-old-podcast-locally-with-open-source-ai.md", "text": "https://wpnews.pro/news/transcribing-my-old-podcast-locally-with-open-source-ai.txt", "jsonld": "https://wpnews.pro/news/transcribing-my-old-podcast-locally-with-open-source-ai.jsonld"}}