{"slug": "voicebox-the-open-source-ai-voice-studio-that-just-hit-28k-stars", "title": "Voicebox: The Open-Source AI Voice Studio That Just Hit 28K Stars", "summary": "Voicebox, an open-source AI voice studio with 28,500 GitHub stars and an MIT license, runs entirely on local hardware and combines voice cloning, dictation, and text-to-speech across 23 languages. The project ships seven TTS engines, including Qwen3-TTS for multilingual cloning and Kokoro for CPU-only operation, along with a built-in MCP server that lets AI agents like Claude Code and Cursor speak through cloned voices with customizable personalities. Voicebox also includes a global hotkey for local dictation using Whisper-based speech-to-text, with support for Apple Silicon, NVIDIA, AMD, and Intel Arc hardware.", "body_md": "I've been watching the voice AI space for a while. ElevenLabs does voice cloning incredibly well. WisprFlow nails voice dictation. But both live in the cloud, both cost money every month, and both require uploading your voice data to someone else's server.\n\nThat's why [Voicebox](https://github.com/jamiepine/voicebox) caught my attention. 28.5k GitHub stars, MIT license, and it runs entirely on your machine. It combines what ElevenLabs does (voice output) with what WisprFlow does (voice input), ties them together with a local LLM, and wraps everything in a polished desktop app.\n\nThe voice cloning takes seconds of reference audio. Upload a short clip, and Voicebox builds a voice model that sounds like you. It covers 23 languages — English, Chinese, Japanese, Arabic, Hindi, Swahili, and more.\n\nUnder the hood, Voicebox ships with 7 TTS engines:\n\n| Engine | Best For |\n|---|---|\nQwen3-TTS |\nHigh-quality multilingual cloning, natural-language delivery instructions |\nChatterbox Turbo |\nEmotion tags (`[laugh]` , `[sigh]` , `[gasp]` ) for expressive speech |\nLuxTTS |\nLightweight (~1GB VRAM), 48kHz, 150x realtime on CPU |\nKokoro |\n82M model, 50 curated preset voices, runs on CPU |\nTADA |\nHumeAI speech-language model, 700s+ coherent audio |\nQwen CustomVoice |\nDelivery control without reference audio |\nChatterbox Multilingual |\n23 languages, broadest coverage |\n\nIf you don't want to clone anything, there are 50+ preset voices ready to go. And after generating audio, you get a full effects panel — reverb, delay, compression, pitch shift, chorus — all powered by Spotify's Pedalboard library, with real-time preview.\n\nThis is the feature that made me actually excited.\n\nVoicebox ships a built-in MCP (Model Context Protocol) server. Any MCP-compatible agent — Claude Code, Cursor, Cline, Windsurf — can call it to speak. Setup takes one command:\n\n```\nclaude mcp add voicebox \\\n  --transport http \\\n  --url http://127.0.0.1:17493/mcp \\\n  --header \"X-Voicebox-Client-Id: claude-code\"\n```\n\nAfter that, your agent can speak through your cloned voice. \"Tests passed, ready to merge\" — in a voice you chose.\n\nYou can assign different voices to different agents. Hear one voice for your code reviewer, another for your deployment bot. And the real kicker: **voice personalities**. Attach a persona description like \"calm engineer\" or \"sarcastic code reviewer,\" and Voicebox's local LLM rewrites the agent's output to match that personality before synthesizing speech. Your agents don't just sound different — they talk differently.\n\nVoicebox includes a global hotkey for dictation. Hold it, speak, release — text pastes into whatever text field you're focused on. On macOS, it uses the accessibility API for precise paste injection without touching your clipboard.\n\nAll dictation stays local. Whisper-based STT runs on your machine. An optional LLM refinement pass cleans up ums and stutters.\n\n| Hardware | Backend |\n|---|---|\n| Apple Silicon | MLX (Metal, 4-5x speed) |\n| NVIDIA GPU | CUDA |\n| AMD GPU | ROCm |\n| Intel Arc | IPEX/XPU |\n| CPU only | Kokoro 82M works fine |\n\nThe app ships as a DMG for macOS and MSI for Windows. First launch auto-downloads the model weights you need — Kokoro is 82MB, Qwen3-TTS a few GB. REST API and MCP server both listen on `localhost:17493`\n\n, with docs at `http://127.0.0.1:17493/docs`\n\n.\n\nVoice I/O going local was always going to happen. Cloud convenience is real, but voice data is biometric data — losing it is closer to losing your fingerprint than losing your email. The fact that open-source TTS and STT models are now good enough to run on consumer hardware changes the equation.\n\nVoicebox isn't just a useful tool. It's a proof point that agents don't have to be silent text boxes. They can speak, emote, and have personality — all without sending your voice to a data center.", "url": "https://wpnews.pro/news/voicebox-the-open-source-ai-voice-studio-that-just-hit-28k-stars", "canonical_source": "https://dev.to/hiroki-ii-ai/voicebox-the-open-source-ai-voice-studio-that-just-hit-28k-stars-77g", "published_at": "2026-05-26 09:45:16+00:00", "updated_at": "2026-05-26 10:04:57.615212+00:00", "lang": "en", "topics": ["ai-tools", "generative-ai", "ai-products"], "entities": ["Voicebox", "ElevenLabs", "WisprFlow", "Qwen3-TTS", "Chatterbox Turbo", "LuxTTS", "Kokoro", "TADA"], "alternates": {"html": "https://wpnews.pro/news/voicebox-the-open-source-ai-voice-studio-that-just-hit-28k-stars", "markdown": "https://wpnews.pro/news/voicebox-the-open-source-ai-voice-studio-that-just-hit-28k-stars.md", "text": "https://wpnews.pro/news/voicebox-the-open-source-ai-voice-studio-that-just-hit-28k-stars.txt", "jsonld": "https://wpnews.pro/news/voicebox-the-open-source-ai-voice-studio-that-just-hit-28k-stars.jsonld"}}