# Voicebox: The Open-Source AI Voice Studio That Just Hit 28K Stars

> Source: <https://dev.to/hiroki-ii-ai/voicebox-the-open-source-ai-voice-studio-that-just-hit-28k-stars-77g>
> Published: 2026-05-26 09:45:16+00:00

I've been watching the voice AI space for a while. ElevenLabs does voice cloning incredibly well. WisprFlow nails voice dictation. But both live in the cloud, both cost money every month, and both require uploading your voice data to someone else's server.

That's why [Voicebox](https://github.com/jamiepine/voicebox) caught my attention. 28.5k GitHub stars, MIT license, and it runs entirely on your machine. It combines what ElevenLabs does (voice output) with what WisprFlow does (voice input), ties them together with a local LLM, and wraps everything in a polished desktop app.

The voice cloning takes seconds of reference audio. Upload a short clip, and Voicebox builds a voice model that sounds like you. It covers 23 languages — English, Chinese, Japanese, Arabic, Hindi, Swahili, and more.

Under the hood, Voicebox ships with 7 TTS engines:

| Engine | Best For |
|---|---|
Qwen3-TTS |
High-quality multilingual cloning, natural-language delivery instructions |
Chatterbox Turbo |
Emotion tags (`[laugh]` , `[sigh]` , `[gasp]` ) for expressive speech |
LuxTTS |
Lightweight (~1GB VRAM), 48kHz, 150x realtime on CPU |
Kokoro |
82M model, 50 curated preset voices, runs on CPU |
TADA |
HumeAI speech-language model, 700s+ coherent audio |
Qwen CustomVoice |
Delivery control without reference audio |
Chatterbox Multilingual |
23 languages, broadest coverage |

If you don't want to clone anything, there are 50+ preset voices ready to go. And after generating audio, you get a full effects panel — reverb, delay, compression, pitch shift, chorus — all powered by Spotify's Pedalboard library, with real-time preview.

This is the feature that made me actually excited.

Voicebox ships a built-in MCP (Model Context Protocol) server. Any MCP-compatible agent — Claude Code, Cursor, Cline, Windsurf — can call it to speak. Setup takes one command:

```
claude mcp add voicebox \
  --transport http \
  --url http://127.0.0.1:17493/mcp \
  --header "X-Voicebox-Client-Id: claude-code"
```

After that, your agent can speak through your cloned voice. "Tests passed, ready to merge" — in a voice you chose.

You can assign different voices to different agents. Hear one voice for your code reviewer, another for your deployment bot. And the real kicker: **voice personalities**. Attach a persona description like "calm engineer" or "sarcastic code reviewer," and Voicebox's local LLM rewrites the agent's output to match that personality before synthesizing speech. Your agents don't just sound different — they talk differently.

Voicebox includes a global hotkey for dictation. Hold it, speak, release — text pastes into whatever text field you're focused on. On macOS, it uses the accessibility API for precise paste injection without touching your clipboard.

All dictation stays local. Whisper-based STT runs on your machine. An optional LLM refinement pass cleans up ums and stutters.

| Hardware | Backend |
|---|---|
| Apple Silicon | MLX (Metal, 4-5x speed) |
| NVIDIA GPU | CUDA |
| AMD GPU | ROCm |
| Intel Arc | IPEX/XPU |
| CPU only | Kokoro 82M works fine |

The app ships as a DMG for macOS and MSI for Windows. First launch auto-downloads the model weights you need — Kokoro is 82MB, Qwen3-TTS a few GB. REST API and MCP server both listen on `localhost:17493`

, with docs at `http://127.0.0.1:17493/docs`

.

Voice I/O going local was always going to happen. Cloud convenience is real, but voice data is biometric data — losing it is closer to losing your fingerprint than losing your email. The fact that open-source TTS and STT models are now good enough to run on consumer hardware changes the equation.

Voicebox isn't just a useful tool. It's a proof point that agents don't have to be silent text boxes. They can speak, emote, and have personality — all without sending your voice to a data center.
