Polly wants a transcript: giving agents ears and a voice, on your own machine

A developer created Kesha Voice Kit, a local-first speech-to-text and text-to-speech toolkit that runs entirely on-device as a single ~20 MB Rust binary with no cloud dependencies, API keys, or accounts required. The tool, named after a cartoon parrot, includes multilingual voice support via the Kokoro model on Apple Silicon and implements stable error codes with machine-readable failure messages. The project's design philosophy emphasizes explicit user consent for model downloads and fast failure with clear error messages rather than producing confident but incorrect output.

Half the messages I send my coding agents these days start life as a voice note. I'm walking the dog, an idea lands, I mumble it into my phone, and later something turns it into text an agent can actually act on. It's a great workflow — right up until you notice where the audio goes to become text. Because the default answer to "transcribe this" is still: ship it to someone's cloud. Whisper API, AWS Transcribe, Deepgram, Google Speech-to-Text. Your voice — which is about as personal as data gets — leaves the building, runs through their model, on their meter, under a privacy policy they can rewrite on a Tuesday. And when you want the round trip — text back to speech — it's the same story: AWS Polly, ElevenLabs, another key, another bill. Meanwhile my laptop has a Neural Engine sitting mostly idle. Whisper-class models run locally just fine now. So why is the audio leaving at all? That itch turned into Kesha Voice Kit https://github.com/drakulavich/kesha-voice-kit 🦜 — and Polly can keep her transcript. Kesha yes, named after the cartoon parrot https://en.wikipedia.org/wiki/The Return of the Prodigal Parrot — "Свободу попугаям " is literally the demo clip is a local-first voice toolkit: speech-to-text and back , no cloud, no account, no API key. The whole thing is one ~20 MB Rust binary — no Python, no ffmpeg , no native Node addons to babysit. The CLI is a thin Bun https://bun.sh wrapper; the engine is the Rust binary it shells out to. Pipe-friendly by design — transcript on stdout, errors on stderr. bun add -g @drakulavich/kesha-voice-kit kesha install downloads engine + models explicit — never automatic kesha audio.ogg → transcript to stdout Want it to talk? kesha install --tts opt-in voices ~990 MB kesha say "Свободу попугаям " freedom.wav That's it. No OPENAI API KEY , no region to pick, no spend alert to set up. The release I just cut adds two things I'd wanted for a while. Multilingual voices. Text-to-speech used to be English + Russian and not much else. 1.22.0 wires up the multilingual Kokoro https://huggingface.co/hexgrad/Kokoro-82M voices on Apple Silicon, so kesha say now covers Spanish, French, Italian, Portuguese and more — all on the Neural Engine, all offline. Stable error codes everywhere. Every failure path now prints a machine-readable line: error E MODEL MISSING : TTS models not installed. Run: kesha install --tts kesha --error-codes-json dumps the whole taxonomy. If you're driving Kesha from a script or an agent, you no longer have to grep prose to find out why it bailed. There's even a test that fails CI if a code exists in the binary but not in the docs — drift is a build break, not a surprise. Here's the kind of thing that only shows up once real users point it at real text. I added the multilingual voices, ran kesha say on a line of Hindi… and got noise . Not wrong words — actual garbage audio. No error, no warning. The most confident kind of broken. The root cause is buried a layer down. The on-device Kokoro path phonemizes text with an English-only grapheme-to-phoneme model. Feed it Latin script and it's happy. Feed it Devanagari, kana, or Han characters and it doesn't fail — it just produces phonemes that mean nothing, and the model dutifully sings them. I had three options: | Option | Verdict | |---|---| | Emit the garbage audio | No. Confidently wrong is the worst failure mode there is. | | Quietly transliterate to Latin and guess | Fragile, surprising, hides the real gap | | Refuse with a clear, coded error | ✅ | So now non-Latin text aimed at a Latin-only voice stops immediately: error E SCRIPT UNSUPPORTED : voice 'hi' cannot phonemize Devanagari text; it only supports Latin-script input. Romanize the text, or use a voice whose engine supports Devanagari. Exit code 4, a stable code, an actionable hint. Fail fast beats fail quietly, every single time — especially for an agent that can't hear that the WAV it got back is nonsense. Real multilingual G2P for those scripts is tracked as an open issue https://github.com/drakulavich/kesha-voice-kit/issues/492 ; until it lands, the tool tells you the truth instead of humming gibberish. The reason any of this exists is that I wanted my agents to hear and speak without phoning home. So Kesha speaks the protocols they do: kesha mcp exposes transcribe / synthesize / list-voices as tools to any MCP client Claude, Cursor, Codex, Gemini @drakulavich/kesha-voice-kit/core API if you'd rather call it from a Bun programA voice note in, a transcript out, an answer spoken back — and the audio never left the laptop. It's not magic. Diarization and the multilingual voices are Apple-Silicon-only today Linux/Windows get a clear error, not a crash . The first TTS download is ~990 MB — local models aren't free, they're just yours . And as above, true G2P for non-Latin scripts isn't here yet. I'd rather ship the limitation with a loud error than paper over it. It's MIT, it's on GitHub https://github.com/drakulavich/kesha-voice-kit and npm https://www.npmjs.com/package/@drakulavich/kesha-voice-kit , and bun add -g @drakulavich/kesha-voice-kit is the whole install. What does your local-first setup look like these days — and what's still quietly phoning home in your stack that you wish wasn't? 🦜