Half the messages I send my coding agents these days start life as a voice note. I'm walking the dog, an idea lands, I mumble it into my phone, and later something turns it into text an agent can actually act on. It's a great workflow — right up until you notice where the audio goes to become text.
Because the default answer to "transcribe this" is still: ship it to someone's cloud. Whisper API, AWS Transcribe, Deepgram, Google Speech-to-Text. Your voice — which is about as personal as data gets — leaves the building, runs through their model, on their meter, under a privacy policy they can rewrite on a Tuesday. And when you want the round trip — text back to speech — it's the same story: AWS Polly, ElevenLabs, another key, another bill.
Meanwhile my laptop has a Neural Engine sitting mostly idle. Whisper-class models run locally just fine now. So why is the audio leaving at all?
That itch turned into Kesha Voice Kit 🦜 — and Polly can keep her transcript.
Kesha (yes, named after the cartoon parrot — "Свободу попугаям!" is literally the demo clip) is a local-first voice toolkit: speech-to-text and back, no cloud, no account, no API key. The whole thing is one ~20 MB Rust binary — no Python, no ffmpeg
, no native Node addons to babysit.
The CLI is a thin Bun wrapper; the engine is the Rust binary it shells out to. Pipe-friendly by design — transcript on stdout, errors on stderr.
bun add -g @drakulavich/kesha-voice-kit
kesha install # downloads engine + models (explicit — never automatic)
kesha audio.ogg # → transcript to stdout
Want it to talk?
kesha install --tts # opt-in voices (~990 MB)
kesha say "Свободу попугаям!" > freedom.wav
That's it. No OPENAI_API_KEY
, no region to pick, no spend alert to set up.
The release I just cut adds two things I'd wanted for a while.
Multilingual voices. Text-to-speech used to be English + Russian and not much else. 1.22.0 wires up the multilingual Kokoro voices on Apple Silicon, so kesha say
now covers Spanish, French, Italian, Portuguese and more — all on the Neural Engine, all offline.
Stable error codes everywhere. Every failure path now prints a machine-readable line:
error [E_MODEL_MISSING]: TTS models not installed. Run: kesha install --tts
kesha --error-codes-json
dumps the whole taxonomy. If you're driving Kesha from a script or an agent, you no longer have to grep prose to find out why it bailed. There's even a test that fails CI if a code exists in the binary but not in the docs — drift is a build break, not a surprise.
Here's the kind of thing that only shows up once real users point it at real text.
I added the multilingual voices, ran kesha say
on a line of Hindi… and got noise. Not wrong words — actual garbage audio. No error, no warning. The most confident kind of broken.
The root cause is buried a layer down. The on-device Kokoro path phonemizes text with an English-only grapheme-to-phoneme model. Feed it Latin script and it's happy. Feed it Devanagari, kana, or Han characters and it doesn't fail — it just produces phonemes that mean nothing, and the model dutifully sings them.
I had three options:
| Option | Verdict |
|---|---|
| Emit the garbage audio | No. Confidently wrong is the worst failure mode there is. |
| Quietly transliterate to Latin and guess | Fragile, surprising, hides the real gap |
| Refuse with a clear, coded error | ✅ |
So now non-Latin text aimed at a Latin-only voice stops immediately:
error [E_SCRIPT_UNSUPPORTED]: voice 'hi' cannot phonemize Devanagari text;
it only supports Latin-script input. Romanize the text, or use a voice
whose engine supports Devanagari.
Exit code 4, a stable code, an actionable hint. Fail fast beats fail quietly, every single time — especially for an agent that can't hear that the WAV it got back is nonsense. Real multilingual G2P for those scripts is tracked as an open issue; until it lands, the tool tells you the truth instead of humming gibberish.
The reason any of this exists is that I wanted my agents to hear and speak without phoning home. So Kesha speaks the protocols they do:
kesha mcp
exposes transcribe / synthesize / list-voices as tools to any MCP client (Claude, Cursor, Codex, Gemini)@drakulavich/kesha-voice-kit/core
API if you'd rather call it from a Bun programA voice note in, a transcript out, an answer spoken back — and the audio never left the laptop.
It's not magic. Diarization and the multilingual voices are Apple-Silicon-only today (Linux/Windows get a clear error, not a crash). The first TTS download is ~990 MB — local models aren't free, they're just yours. And as above, true G2P for non-Latin scripts isn't here yet. I'd rather ship the limitation with a loud error than paper over it.
It's MIT, it's on GitHub and npm, and bun add -g @drakulavich/kesha-voice-kit
is the whole install.
What does your local-first setup look like these days — and what's still quietly phoning home in your stack that you wish wasn't? 🦜