Polly wants a transcript: giving agents ears and a voice, on your own machine

wpnews.pro

cd /news/ai-tools/polly-wants-a-transcript-giving-agen… · home › topics › ai-tools › article

[ARTICLE · art-19189] src=dev.to ↗ pub=2026-05-31T13:11Z topic=ai-tools verified=true sentiment=↑ positive

Polly wants a transcript: giving agents ears and a voice, on your own machine

A developer created Kesha Voice Kit, a local-first speech-to-text and text-to-speech toolkit that runs entirely on-device as a single ~20 MB Rust binary with no cloud dependencies, API keys, or accounts required. The tool, named after a cartoon parrot, includes multilingual voice support via the Kokoro model on Apple Silicon and implements stable error codes with machine-readable failure messages. The project's design philosophy emphasizes explicit user consent for model downloads and fast failure with clear error messages rather than producing confident but incorrect output.

read4 min views22 publishedMay 31, 2026

Half the messages I send my coding agents these days start life as a voice note. I'm walking the dog, an idea lands, I mumble it into my phone, and later something turns it into text an agent can actually act on. It's a great workflow — right up until you notice where the audio goes to become text.

Because the default answer to "transcribe this" is still: ship it to someone's cloud. Whisper API, AWS Transcribe, Deepgram, Google Speech-to-Text. Your voice — which is about as personal as data gets — leaves the building, runs through their model, on their meter, under a privacy policy they can rewrite on a Tuesday. And when you want the round trip — text back to speech — it's the same story: AWS Polly, ElevenLabs, another key, another bill.

Meanwhile my laptop has a Neural Engine sitting mostly idle. Whisper-class models run locally just fine now. So why is the audio leaving at all?

That itch turned into Kesha Voice Kit 🦜 — and Polly can keep her transcript.

Kesha (yes, named after the cartoon parrot — "Свободу попугаям!" is literally the demo clip) is a local-first voice toolkit: speech-to-text and back, no cloud, no account, no API key. The whole thing is one ~20 MB Rust binary — no Python, no ffmpeg

, no native Node addons to babysit.

The CLI is a thin Bun wrapper; the engine is the Rust binary it shells out to. Pipe-friendly by design — transcript on stdout, errors on stderr.

bun add -g @drakulavich/kesha-voice-kit
kesha install                 # downloads engine + models (explicit — never automatic)

kesha audio.ogg               # → transcript to stdout

Want it to talk?

kesha install --tts           # opt-in voices (~990 MB)
kesha say "Свободу попугаям!" > freedom.wav

That's it. No OPENAI_API_KEY

, no region to pick, no spend alert to set up.

The release I just cut adds two things I'd wanted for a while.

Multilingual voices. Text-to-speech used to be English + Russian and not much else. 1.22.0 wires up the multilingual Kokoro voices on Apple Silicon, so kesha say

now covers Spanish, French, Italian, Portuguese and more — all on the Neural Engine, all offline.

Stable error codes everywhere. Every failure path now prints a machine-readable line:

error [E_MODEL_MISSING]: TTS models not installed. Run: kesha install --tts

kesha --error-codes-json

dumps the whole taxonomy. If you're driving Kesha from a script or an agent, you no longer have to grep prose to find out why it bailed. There's even a test that fails CI if a code exists in the binary but not in the docs — drift is a build break, not a surprise.

Here's the kind of thing that only shows up once real users point it at real text.

I added the multilingual voices, ran kesha say

on a line of Hindi… and got noise. Not wrong words — actual garbage audio. No error, no warning. The most confident kind of broken.

The root cause is buried a layer down. The on-device Kokoro path phonemizes text with an English-only grapheme-to-phoneme model. Feed it Latin script and it's happy. Feed it Devanagari, kana, or Han characters and it doesn't fail — it just produces phonemes that mean nothing, and the model dutifully sings them.

I had three options:

Option	Verdict
Emit the garbage audio	No. Confidently wrong is the worst failure mode there is.
Quietly transliterate to Latin and guess	Fragile, surprising, hides the real gap
Refuse with a clear, coded error	✅

So now non-Latin text aimed at a Latin-only voice stops immediately:

error [E_SCRIPT_UNSUPPORTED]: voice 'hi' cannot phonemize Devanagari text;
it only supports Latin-script input. Romanize the text, or use a voice
whose engine supports Devanagari.

Exit code 4, a stable code, an actionable hint. Fail fast beats fail quietly, every single time — especially for an agent that can't hear that the WAV it got back is nonsense. Real multilingual G2P for those scripts is tracked as an open issue; until it lands, the tool tells you the truth instead of humming gibberish.

The reason any of this exists is that I wanted my agents to hear and speak without phoning home. So Kesha speaks the protocols they do:

kesha mcp

exposes transcribe / synthesize / list-voices as tools to any MCP client (Claude, Cursor, Codex, Gemini)@drakulavich/kesha-voice-kit/core

API if you'd rather call it from a Bun programA voice note in, a transcript out, an answer spoken back — and the audio never left the laptop.

It's not magic. Diarization and the multilingual voices are Apple-Silicon-only today (Linux/Windows get a clear error, not a crash). The first TTS download is ~990 MB — local models aren't free, they're just yours. And as above, true G2P for non-Latin scripts isn't here yet. I'd rather ship the limitation with a loud error than paper over it.

It's MIT, it's on GitHub and npm, and bun add -g @drakulavich/kesha-voice-kit

is the whole install.

What does your local-first setup look like these days — and what's still quietly phoning home in your stack that you wish wasn't? 🦜

source & further reading

dev.to — original article Beyond Scaling Laws: Why "Thinking Longer" Is a Systems Problem, Not a Prompting Trick Access to Claude in China sells for 70-90% below the official API price The AI Revolution: 2026 FIFA World Cup

~/api · this article 200

$curl api.wpnews.pro/v1/news/polly-wants-a-transcript…

Read original on dev.to → dev.to/drakulavich/polly-wants-a-transcript-givi…

mentioned entities

Kesha Voice Kit

Whisper

AWS Transcribe

Deepgram

Google Speech-to-Text

AWS Polly

ElevenLabs

Bun

metadata

slugpolly-wants-a-transcript-giving-agents-ears-and-a-voice-on-your-own-machine

topic#ai-tools

secondary4 topics

sentimentpositive

canonicaldev.to

navigation

← prevAI coding assistants make junior…

next →AI doesn't fail because the mode…

── more in #ai-tools 4 stories · sorted by recency

dev.to · 15 Jul · #ai-tools

The AI Revolution: 2026 FIFA World Cup

9to5mac.com · 14 Jul · #ai-tools

MacWhisper 14 launches with a new transcript editor, faster performance, more

databricks.com · 15 Jul · #ai-tools

AI-Enabled Advisory Services for Higher Education

pocketweb.tools · 15 Jul · #ai-tools

Free local AI video clipper – No Upload, All in browser

── more on @kesha voice kit 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required