{"slug": "i-built-a-voice-ai-tutor-in-200-lines-of-code-and-zero-backend", "title": "I Built a Voice AI Tutor in 200 Lines of Code (and Zero Backend)", "summary": "To build a voice AI assistant entirely in a web browser using about 200 lines of code and no backend server. It breaks down the process into three core components: Speech-to-Text (STT) using the free Web Speech API, a Large Language Model (LLM) using Google's free Gemini 2.5 Flash, and Text-to-Speech (TTS) using the browser's built-in `window.speechSynthesis`. The author emphasizes that this simple, swappable architecture—where each component has both free and premium options—is the fundamental pattern behind all modern voice AI systems.", "body_md": "Open Siri. Ask it a question. Listen to the reply.\n\nThat whole experience — the magic that powers Alexa, ChatGPT voice mode, every car assistant, every drive-through screen — is **three steps glued together**.\n\n- Turn microphone audio into text.\n- Send the text to a brain.\n- Turn the brain's reply back into audio.\n\nThat's it. The whole industry of voice AI is variations on those three boxes. Different brains, different microphones, different voices, but the shape is identical.\n\nToday I'm going to build the whole thing in your browser. No server. No install. No API key except a single free one. Open the tab, click the mic, talk to an AI. Total code: about 200 lines.\n\nThe pattern is the actual lesson. Once you see it, you can replace any box with a fancier one — Whisper for transcription, ElevenLabs for voices, your own fine-tuned model in the middle — and the architecture doesn't change.\n\n## The three Lego bricks\n\nLet me name them with the boring acronyms so you can search for them later:\n\n**STT — Speech-to-Text.** Microphone audio → string of words. The expensive option is OpenAI Whisper (best accuracy, costs about a third of a cent per minute). The free option, which I'm using here, is the **Web Speech API**, which has shipped in Chrome since 2013. You give it a microphone permission and it gives you back text. Zero key, zero upload — Chrome talks to Google's recognizer behind the scenes for you. It's slightly less accurate than Whisper, especially on accents, but for a learning demo the difference doesn't matter.\n\n**LLM — the brain.** This is the part everyone gets excited about. You hand a string to a Large Language Model and it hands a string back. ChatGPT, Claude, Gemini — they all expose the same shape: send a list of messages, get a message back. I'm using **Gemini 2.5 Flash** because Google gives it away free at 15 requests per minute. Beginners shouldn't have to wave a credit card to learn how this works.\n\n**TTS — Text-to-Speech.** String → audio you can play. The fancy option is ElevenLabs, whose voices are so good they sound uncanny. The free, zero-key option is `window.speechSynthesis`\n\n, which has shipped in every major browser since 2014. It sounds robotic, but it's instant and it costs nothing.\n\nNotice the pattern: every brick has an expensive flavor and a free flavor. The interfaces are identical. You can swap one for the other without changing the architecture. **That's why this is worth learning.**\n\n## Wiring the loop\n\nHere's the entire pipeline in pseudocode:\n\n```\nstate = \"idle\"\n\nwhile user wants to talk:\n    state = \"listening\"\n    text = await STT.listen()        # mic open until silence\n    state = \"thinking\"\n    reply = await LLM.ask(text)      # 1-2 seconds typically\n    state = \"speaking\"\n    await TTS.say(reply)             # plays through speakers\n    state = \"idle\"\n```\n\nThe state machine matters more than you'd think. If the user clicks the mic while the assistant is still talking, you need to cancel the playback. If they click while the LLM is still thinking, you need to keep them out. UIs get confusing fast when you have four states and one button. I'll show you the React version in a minute.\n\n## The STT brick\n\nThe browser ships a class called `SpeechRecognition`\n\n(with a `webkit`\n\nprefix on Safari). The API is event-based, not promise-based, which is a little annoying — but the pattern is straightforward:\n\n``` js\nconst rec = new SpeechRecognition();\nrec.lang = \"en-US\";\nrec.continuous = true;       // keep mic open across pauses\nrec.interimResults = true;   // stream partials while user talks\n\nrec.onresult = (e) => {\n  for (let i = e.resultIndex; i < e.results.length; i++) {\n    const r = e.results[i];\n    if (r.isFinal) onFinal(r[0].transcript);\n    else onPartial(r[0].transcript);\n  }\n};\n\nrec.start();\n```\n\nTwo things to notice. First, ** interimResults** is a gift. It streams text while the user is still talking, so you can show \"you're saying...\" in real time. It feels alive instead of laggy. Second,\n\n**lets you only walk new results since the last fire — the browser keeps the whole session's results in the**\n\n`resultIndex`\n\n`results`\n\narray, but you usually only care about what's new.##\n\nThe LLM brick\n\nGoogle's SDK makes this almost embarrassingly short:\n\n``` js\nimport { GoogleGenerativeAI } from \"@google/generative-ai\";\n\nconst ai = new GoogleGenerativeAI(API_KEY);\nconst model = ai.getGenerativeModel({\n  model: \"gemini-2.5-flash\",\n  systemInstruction: \"Reply in 1-3 short sentences. No markdown.\",\n  generationConfig: { maxOutputTokens: 200 },\n});\n\nconst chat = model.startChat({ history });\nconst result = await chat.sendMessage(userText);\nconst reply = result.response.text();\n```\n\nTwo design choices worth calling out.\n\n**System prompt.** I tell the model to keep answers under 60 words. Why? Because the TTS will read every word. If Gemini writes a Wikipedia paragraph, your user is going to sit through 90 seconds of robot voice waiting for the next chance to talk. Voice AIs need to be terser than text AIs. This is a real lesson — half of building voice products is wrestling the model down to a sentence or two.\n\n**maxOutputTokens.** A hard ceiling. Even if the model decides to ignore the system prompt and ramble, this cuts it off. Belt and suspenders.\n\n## The TTS brick\n\n``` js\nconst u = new SpeechSynthesisUtterance(text);\nu.lang = \"en-US\";\nu.rate = 1.0;\nu.voice = bestVoiceFor(\"en-US\");\nspeechSynthesis.cancel();   // kill anything currently playing\nspeechSynthesis.speak(u);\n```\n\nThe one gotcha: `speechSynthesis.getVoices()`\n\nreturns an empty array the first time you call it. Voices load asynchronously and Chrome fires a `voiceschanged`\n\nevent when they're ready. So I wrap voice-loading in a one-shot promise that callers can await. Otherwise your first reply plays in the browser's default voice instead of the nice Google one.\n\n## Wiring it in React\n\nThe whole React component is a state machine over `phase: \"idle\" | \"listening\" | \"thinking\" | \"speaking\"`\n\nand a list of messages.\n\n```\nconst [phase, setPhase] = useState<Phase>(\"idle\");\nconst [messages, setMessages] = useState<Message[]>([]);\n\nconst startListening = () => {\n  setPhase(\"listening\");\n  stt.start({\n    onFinal: async (text) => {\n      stt.stop();\n      const userMsg = { role: \"user\", text };\n      setMessages(curr => [...curr, userMsg]);\n      setPhase(\"thinking\");\n      const reply = await askGemini([...messages, userMsg], text);\n      setMessages(curr => [...curr, { role: \"model\", text: reply }]);\n      setPhase(\"speaking\");\n      speak(reply, { onEnd: () => setPhase(\"idle\") });\n    },\n  });\n};\n```\n\nThe mic button changes label based on phase. Click it during `idle`\n\nto start listening, click it during `listening`\n\n/`speaking`\n\nto stop. The transcript renders as a list of bubbles. That's the whole UI.\n\n## What I learned actually building this\n\nA few real takeaways from spending an afternoon on this:\n\n**1. Browser TTS quality is better than you remember.** The Google voices on Chrome are genuinely fine. They were embarrassing in 2015. They're not embarrassing now. For a learning demo, ElevenLabs is overkill.\n\n**2. The pipeline is the lesson, not the tools.** When a recruiter says \"build a voice agent,\" they don't mean \"use these three specific libraries.\" They mean \"wire mic, brain, and speaker together with a state machine that doesn't get confused.\" Once you can do that, you can swap parts.\n\n**3. Voice changes how you prompt.** A system prompt that's great for ChatGPT (gives bulleted lists, uses headings) is terrible for voice. The TTS reads \"asterisk asterisk\" out loud. Tell the model \"no markdown, no lists, one paragraph\" or live with the consequences.\n\n**4. State machines beat booleans.** I started with `isListening`\n\n+ `isThinking`\n\n+ `isSpeaking`\n\nbooleans. Within five minutes I had bugs where two were true at once. A single `phase`\n\nenum makes the impossible states actually impossible. Reach for this earlier than you think.\n\n**5. Free tiers are enough to learn on.** Gemini's free tier covers ~14,000 requests per day. You will not run out while learning. Don't let \"what API should I pay for\" stop you from starting.\n\n## Why this matters\n\nEvery \"AI agent\" startup right now is some variation of these three boxes plus a loop. Voice tutors, customer service bots, drive-throughs, in-car assistants, accessibility tools. Once you can wire the three bricks, you can build any of them. The hard part is taste — which brain, which voice, which prompt, which moment to interrupt. That's the next ten years of product work, and it's all built on top of the architecture you can spin up in a single afternoon.\n\nSo go spin it up. Open the repo. Read the commits one at a time. The first commit is an empty React shell. The seventh commit is the entire app. Each commit is one concept. You'll get more out of reading the seven small steps than reading one huge final file.\n\n## Try it / fork it\n\n🌐 Live: [https://voice-from-zero.vercel.app](https://voice-from-zero.vercel.app)\n\n🐙 Code: [https://github.com/dev48v/voice-from-zero](https://github.com/dev48v/voice-from-zero)\n\nThis is Day 35 of TechFromZero — a 50-day series where I build one tech from scratch every day with step-by-step commits you can read like a textbook. Yesterday was Stable Diffusion. Tomorrow is 3D in the browser with Three.js.\n\nIf you're learning AI and want a low-stakes way to actually ship something — clone the repo, change the model, change the voice, change the system prompt, and you'll have an entirely different demo by lunch. Make it a French tutor. Make it a Dungeon Master. Make it a meditation guide. The Legos snap together however you want.\n\n🌐 See all days: [https://dev48v.infy.uk/techfromzero.php](https://dev48v.infy.uk/techfromzero.php)\n\nTalk to you tomorrow.", "url": "https://wpnews.pro/news/i-built-a-voice-ai-tutor-in-200-lines-of-code-and-zero-backend", "canonical_source": "https://dev.to/dev48v/i-built-a-voice-ai-tutor-in-200-lines-of-code-and-zero-backend-7fe", "published_at": "2026-05-18 20:11:46+00:00", "updated_at": "2026-05-18 20:31:28.430976+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "developer-tools", "products"], "entities": ["OpenAI Whisper", "ElevenLabs", "Web Speech API", "Gemini 2.5 Flash", "Chrome", "Google", "Siri", "Alexa"], "alternates": {"html": "https://wpnews.pro/news/i-built-a-voice-ai-tutor-in-200-lines-of-code-and-zero-backend", "markdown": "https://wpnews.pro/news/i-built-a-voice-ai-tutor-in-200-lines-of-code-and-zero-backend.md", "text": "https://wpnews.pro/news/i-built-a-voice-ai-tutor-in-200-lines-of-code-and-zero-backend.txt", "jsonld": "https://wpnews.pro/news/i-built-a-voice-ai-tutor-in-200-lines-of-code-and-zero-backend.jsonld"}}