cd /news/ai-tools/voicebox-the-open-source-ai-voice-st… Β· home β€Ί topics β€Ί ai-tools β€Ί article
[ARTICLE Β· art-14278] src=dev.to pub= topic=ai-tools verified=true sentiment=↑ positive

Voicebox: The Open-Source AI Voice Studio That Just Hit 28K Stars

Voicebox, an open-source AI voice studio with 28,500 GitHub stars and an MIT license, runs entirely on local hardware and combines voice cloning, dictation, and text-to-speech across 23 languages. The project ships seven TTS engines, including Qwen3-TTS for multilingual cloning and Kokoro for CPU-only operation, along with a built-in MCP server that lets AI agents like Claude Code and Cursor speak through cloned voices with customizable personalities. Voicebox also includes a global hotkey for local dictation using Whisper-based speech-to-text, with support for Apple Silicon, NVIDIA, AMD, and Intel Arc hardware.

read3 min publishedMay 26, 2026

I've been watching the voice AI space for a while. ElevenLabs does voice cloning incredibly well. WisprFlow nails voice dictation. But both live in the cloud, both cost money every month, and both require up your voice data to someone else's server.

That's why Voicebox caught my attention. 28.5k GitHub stars, MIT license, and it runs entirely on your machine. It combines what ElevenLabs does (voice output) with what WisprFlow does (voice input), ties them together with a local LLM, and wraps everything in a polished desktop app.

The voice cloning takes seconds of reference audio. Upload a short clip, and Voicebox builds a voice model that sounds like you. It covers 23 languages β€” English, Chinese, Japanese, Arabic, Hindi, Swahili, and more.

Under the hood, Voicebox ships with 7 TTS engines:

Engine Best For
Qwen3-TTS
High-quality multilingual cloning, natural-language delivery instructions
Chatterbox Turbo
Emotion tags ([laugh] , [sigh] , [gasp] ) for expressive speech
LuxTTS
Lightweight (~1GB VRAM), 48kHz, 150x realtime on CPU
Kokoro
82M model, 50 curated preset voices, runs on CPU
TADA
HumeAI speech-language model, 700s+ coherent audio
Qwen CustomVoice
Delivery control without reference audio
Chatterbox Multilingual
23 languages, broadest coverage

If you don't want to clone anything, there are 50+ preset voices ready to go. And after generating audio, you get a full effects panel β€” reverb, delay, compression, pitch shift, chorus β€” all powered by Spotify's Pedalboard library, with real-time preview.

This is the feature that made me actually excited.

Voicebox ships a built-in MCP (Model Context Protocol) server. Any MCP-compatible agent β€” Claude Code, Cursor, Cline, Windsurf β€” can call it to speak. Setup takes one command:

claude mcp add voicebox \
  --transport http \
  --url http://127.0.0.1:17493/mcp \
  --header "X-Voicebox-Client-Id: claude-code"

After that, your agent can speak through your cloned voice. "Tests passed, ready to merge" β€” in a voice you chose.

You can assign different voices to different agents. Hear one voice for your code reviewer, another for your deployment bot. And the real kicker: voice personalities. Attach a persona description like "calm engineer" or "sarcastic code reviewer," and Voicebox's local LLM rewrites the agent's output to match that personality before synthesizing speech. Your agents don't just sound different β€” they talk differently.

Voicebox includes a global hotkey for dictation. Hold it, speak, release β€” text pastes into whatever text field you're focused on. On macOS, it uses the accessibility API for precise paste injection without touching your clipboard.

All dictation stays local. Whisper-based STT runs on your machine. An optional LLM refinement pass cleans up ums and stutters.

Hardware Backend
Apple Silicon MLX (Metal, 4-5x speed)
NVIDIA GPU CUDA
AMD GPU ROCm
Intel Arc IPEX/XPU
CPU only Kokoro 82M works fine

The app ships as a DMG for macOS and MSI for Windows. First launch auto-downloads the model weights you need β€” Kokoro is 82MB, Qwen3-TTS a few GB. REST API and MCP server both listen on localhost:17493

, with docs at http://127.0.0.1:17493/docs

.

Voice I/O going local was always going to happen. Cloud convenience is real, but voice data is biometric data β€” losing it is closer to losing your fingerprint than losing your email. The fact that open-source TTS and STT models are now good enough to run on consumer hardware changes the equation.

Voicebox isn't just a useful tool. It's a proof point that agents don't have to be silent text boxes. They can speak, emote, and have personality β€” all without sending your voice to a data center.

── more in #ai-tools 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/voicebox-the-open-so…] indexed:0 read:3min 2026-05-26 Β· β€”