I Built a Voice AI Tutor in 200 Lines of Code (and Zero Backend) To build a voice AI assistant entirely in a web browser using about 200 lines of code and no backend server. It breaks down the process into three core components: Speech-to-Text (STT) using the free Web Speech API, a Large Language Model (LLM) using Google's free Gemini 2.5 Flash, and Text-to-Speech (TTS) using the browser's built-in `window.speechSynthesis`. The author emphasizes that this simple, swappable architecture—where each component has both free and premium options—is the fundamental pattern behind all modern voice AI systems. Open Siri. Ask it a question. Listen to the reply. That whole experience — the magic that powers Alexa, ChatGPT voice mode, every car assistant, every drive-through screen — is three steps glued together . - Turn microphone audio into text. - Send the text to a brain. - Turn the brain's reply back into audio. That's it. The whole industry of voice AI is variations on those three boxes. Different brains, different microphones, different voices, but the shape is identical. Today I'm going to build the whole thing in your browser. No server. No install. No API key except a single free one. Open the tab, click the mic, talk to an AI. Total code: about 200 lines. The pattern is the actual lesson. Once you see it, you can replace any box with a fancier one — Whisper for transcription, ElevenLabs for voices, your own fine-tuned model in the middle — and the architecture doesn't change. The three Lego bricks Let me name them with the boring acronyms so you can search for them later: STT — Speech-to-Text. Microphone audio → string of words. The expensive option is OpenAI Whisper best accuracy, costs about a third of a cent per minute . The free option, which I'm using here, is the Web Speech API , which has shipped in Chrome since 2013. You give it a microphone permission and it gives you back text. Zero key, zero upload — Chrome talks to Google's recognizer behind the scenes for you. It's slightly less accurate than Whisper, especially on accents, but for a learning demo the difference doesn't matter. LLM — the brain. This is the part everyone gets excited about. You hand a string to a Large Language Model and it hands a string back. ChatGPT, Claude, Gemini — they all expose the same shape: send a list of messages, get a message back. I'm using Gemini 2.5 Flash because Google gives it away free at 15 requests per minute. Beginners shouldn't have to wave a credit card to learn how this works. TTS — Text-to-Speech. String → audio you can play. The fancy option is ElevenLabs, whose voices are so good they sound uncanny. The free, zero-key option is window.speechSynthesis , which has shipped in every major browser since 2014. It sounds robotic, but it's instant and it costs nothing. Notice the pattern: every brick has an expensive flavor and a free flavor. The interfaces are identical. You can swap one for the other without changing the architecture. That's why this is worth learning. Wiring the loop Here's the entire pipeline in pseudocode: state = "idle" while user wants to talk: state = "listening" text = await STT.listen mic open until silence state = "thinking" reply = await LLM.ask text 1-2 seconds typically state = "speaking" await TTS.say reply plays through speakers state = "idle" The state machine matters more than you'd think. If the user clicks the mic while the assistant is still talking, you need to cancel the playback. If they click while the LLM is still thinking, you need to keep them out. UIs get confusing fast when you have four states and one button. I'll show you the React version in a minute. The STT brick The browser ships a class called SpeechRecognition with a webkit prefix on Safari . The API is event-based, not promise-based, which is a little annoying — but the pattern is straightforward: js const rec = new SpeechRecognition ; rec.lang = "en-US"; rec.continuous = true; // keep mic open across pauses rec.interimResults = true; // stream partials while user talks rec.onresult = e = { for let i = e.resultIndex; i < e.results.length; i++ { const r = e.results i ; if r.isFinal onFinal r 0 .transcript ; else onPartial r 0 .transcript ; } }; rec.start ; Two things to notice. First, interimResults is a gift. It streams text while the user is still talking, so you can show "you're saying..." in real time. It feels alive instead of laggy. Second, lets you only walk new results since the last fire — the browser keeps the whole session's results in the resultIndex results array, but you usually only care about what's new. The LLM brick Google's SDK makes this almost embarrassingly short: js import { GoogleGenerativeAI } from "@google/generative-ai"; const ai = new GoogleGenerativeAI API KEY ; const model = ai.getGenerativeModel { model: "gemini-2.5-flash", systemInstruction: "Reply in 1-3 short sentences. No markdown.", generationConfig: { maxOutputTokens: 200 }, } ; const chat = model.startChat { history } ; const result = await chat.sendMessage userText ; const reply = result.response.text ; Two design choices worth calling out. System prompt. I tell the model to keep answers under 60 words. Why? Because the TTS will read every word. If Gemini writes a Wikipedia paragraph, your user is going to sit through 90 seconds of robot voice waiting for the next chance to talk. Voice AIs need to be terser than text AIs. This is a real lesson — half of building voice products is wrestling the model down to a sentence or two. maxOutputTokens. A hard ceiling. Even if the model decides to ignore the system prompt and ramble, this cuts it off. Belt and suspenders. The TTS brick js const u = new SpeechSynthesisUtterance text ; u.lang = "en-US"; u.rate = 1.0; u.voice = bestVoiceFor "en-US" ; speechSynthesis.cancel ; // kill anything currently playing speechSynthesis.speak u ; The one gotcha: speechSynthesis.getVoices returns an empty array the first time you call it. Voices load asynchronously and Chrome fires a voiceschanged event when they're ready. So I wrap voice-loading in a one-shot promise that callers can await. Otherwise your first reply plays in the browser's default voice instead of the nice Google one. Wiring it in React The whole React component is a state machine over phase: "idle" | "listening" | "thinking" | "speaking" and a list of messages. const phase, setPhase = useState