{"slug": "polly-wants-a-transcript-giving-agents-ears-and-a-voice-on-your-own-machine", "title": "Polly wants a transcript: giving agents ears and a voice, on your own machine", "summary": "A developer created Kesha Voice Kit, a local-first speech-to-text and text-to-speech toolkit that runs entirely on-device as a single ~20 MB Rust binary with no cloud dependencies, API keys, or accounts required. The tool, named after a cartoon parrot, includes multilingual voice support via the Kokoro model on Apple Silicon and implements stable error codes with machine-readable failure messages. The project's design philosophy emphasizes explicit user consent for model downloads and fast failure with clear error messages rather than producing confident but incorrect output.", "body_md": "Half the messages I send my coding agents these days start life as a voice note. I'm walking the dog, an idea lands, I mumble it into my phone, and later something turns it into text an agent can actually act on. It's a great workflow — right up until you notice *where* the audio goes to become text.\n\nBecause the default answer to \"transcribe this\" is still: ship it to someone's cloud. Whisper API, AWS Transcribe, Deepgram, Google Speech-to-Text. Your voice — which is about as personal as data gets — leaves the building, runs through *their* model, on *their* meter, under a privacy policy they can rewrite on a Tuesday. And when you want the round trip — text *back* to speech — it's the same story: AWS Polly, ElevenLabs, another key, another bill.\n\nMeanwhile my laptop has a Neural Engine sitting mostly idle. Whisper-class models run locally just fine now. So why is the audio leaving at all?\n\nThat itch turned into [Kesha Voice Kit](https://github.com/drakulavich/kesha-voice-kit) 🦜 — and Polly can keep her transcript.\n\nKesha (yes, named after [the cartoon parrot](https://en.wikipedia.org/wiki/The_Return_of_the_Prodigal_Parrot) — *\"Свободу попугаям!\"* is literally the demo clip) is a local-first voice toolkit: **speech-to-text and back**, no cloud, no account, no API key. The whole thing is one ~20 MB Rust binary — no Python, no `ffmpeg`\n\n, no native Node addons to babysit.\n\nThe CLI is a thin [Bun](https://bun.sh) wrapper; the engine is the Rust binary it shells out to. Pipe-friendly by design — transcript on stdout, errors on stderr.\n\n```\nbun add -g @drakulavich/kesha-voice-kit\nkesha install                 # downloads engine + models (explicit — never automatic)\n\nkesha audio.ogg               # → transcript to stdout\n```\n\nWant it to talk?\n\n```\nkesha install --tts           # opt-in voices (~990 MB)\nkesha say \"Свободу попугаям!\" > freedom.wav\n```\n\nThat's it. No `OPENAI_API_KEY`\n\n, no region to pick, no spend alert to set up.\n\nThe release I just cut adds two things I'd wanted for a while.\n\n**Multilingual voices.** Text-to-speech used to be English + Russian and not much else. 1.22.0 wires up the multilingual [Kokoro](https://huggingface.co/hexgrad/Kokoro-82M) voices on Apple Silicon, so `kesha say`\n\nnow covers Spanish, French, Italian, Portuguese and more — all on the Neural Engine, all offline.\n\n**Stable error codes everywhere.** Every failure path now prints a machine-readable line:\n\n```\nerror [E_MODEL_MISSING]: TTS models not installed. Run: kesha install --tts\n```\n\n`kesha --error-codes-json`\n\ndumps the whole taxonomy. If you're driving Kesha from a script or an agent, you no longer have to grep prose to find out *why* it bailed. There's even a test that fails CI if a code exists in the binary but not in the docs — drift is a build break, not a surprise.\n\nHere's the kind of thing that only shows up once real users point it at real text.\n\nI added the multilingual voices, ran `kesha say`\n\non a line of Hindi… and got **noise**. Not wrong words — actual garbage audio. No error, no warning. The most confident kind of broken.\n\nThe root cause is buried a layer down. The on-device Kokoro path phonemizes text with an **English-only** grapheme-to-phoneme model. Feed it Latin script and it's happy. Feed it Devanagari, kana, or Han characters and it doesn't *fail* — it just produces phonemes that mean nothing, and the model dutifully sings them.\n\nI had three options:\n\n| Option | Verdict |\n|---|---|\n| Emit the garbage audio | No. Confidently wrong is the worst failure mode there is. |\n| Quietly transliterate to Latin and guess | Fragile, surprising, hides the real gap |\n| Refuse with a clear, coded error | ✅ |\n\nSo now non-Latin text aimed at a Latin-only voice stops immediately:\n\n```\nerror [E_SCRIPT_UNSUPPORTED]: voice 'hi' cannot phonemize Devanagari text;\nit only supports Latin-script input. Romanize the text, or use a voice\nwhose engine supports Devanagari.\n```\n\nExit code 4, a stable code, an actionable hint. **Fail fast beats fail quietly, every single time** — especially for an agent that can't *hear* that the WAV it got back is nonsense. Real multilingual G2P for those scripts is [tracked as an open issue](https://github.com/drakulavich/kesha-voice-kit/issues/492); until it lands, the tool tells you the truth instead of humming gibberish.\n\nThe reason any of this exists is that I wanted my agents to hear and speak without phoning home. So Kesha speaks the protocols they do:\n\n`kesha mcp`\n\nexposes transcribe / synthesize / list-voices as tools to any MCP client (Claude, Cursor, Codex, Gemini)`@drakulavich/kesha-voice-kit/core`\n\nAPI if you'd rather call it from a Bun programA voice note in, a transcript out, an answer spoken back — and the audio never left the laptop.\n\nIt's not magic. Diarization and the multilingual voices are Apple-Silicon-only today (Linux/Windows get a clear error, not a crash). The first TTS download is ~990 MB — local models aren't free, they're just *yours*. And as above, true G2P for non-Latin scripts isn't here yet. I'd rather ship the limitation with a loud error than paper over it.\n\nIt's MIT, it's on [GitHub](https://github.com/drakulavich/kesha-voice-kit) and [npm](https://www.npmjs.com/package/@drakulavich/kesha-voice-kit), and `bun add -g @drakulavich/kesha-voice-kit`\n\nis the whole install.\n\nWhat does your local-first setup look like these days — and what's still quietly phoning home in your stack that you wish wasn't? 🦜", "url": "https://wpnews.pro/news/polly-wants-a-transcript-giving-agents-ears-and-a-voice-on-your-own-machine", "canonical_source": "https://dev.to/drakulavich/polly-wants-a-transcript-giving-agents-ears-and-a-voice-on-your-own-machine-165", "published_at": "2026-05-31 13:11:25+00:00", "updated_at": "2026-05-31 13:42:18.915446+00:00", "lang": "en", "topics": ["ai-tools", "natural-language-processing", "ai-products", "ai-agents", "ai-infrastructure"], "entities": ["Kesha Voice Kit", "Whisper", "AWS Transcribe", "Deepgram", "Google Speech-to-Text", "AWS Polly", "ElevenLabs", "Bun"], "alternates": {"html": "https://wpnews.pro/news/polly-wants-a-transcript-giving-agents-ears-and-a-voice-on-your-own-machine", "markdown": "https://wpnews.pro/news/polly-wants-a-transcript-giving-agents-ears-and-a-voice-on-your-own-machine.md", "text": "https://wpnews.pro/news/polly-wants-a-transcript-giving-agents-ears-and-a-voice-on-your-own-machine.txt", "jsonld": "https://wpnews.pro/news/polly-wants-a-transcript-giving-agents-ears-and-a-voice-on-your-own-machine.jsonld"}}