{"slug": "building-a-real-time-ai-voice-agent-with-openai-realtime-api-and-next-js", "title": "Building a Real-Time AI Voice Agent with OpenAI Realtime API and Next.js", "summary": "A developer built a real-time AI voice agent using Next.js, WebRTC, and OpenAI's Realtime API, achieving sub-800ms latency for natural conversational interactions. The system captures audio via the Web Audio API, streams it over WebSockets, and uses OpenAI's streaming model for speech-to-text and text-to-speech. It includes function calling for business tasks like booking and support, and session-based memory to maintain context.", "body_md": "Voice interfaces are rapidly becoming the next major interaction layer after mobile and web UI. Instead of clicking, users will increasingly talk to systems that understand intent, context, and can execute actions in real time.\n\nIn this article, we’ll build a production-grade architecture for a real-time AI voice system using modern web technologies such as Next.js, WebRTC, and OpenAI’s streaming capabilities.\n\nWe’ll also explore how this architecture powers modern conversational systems like an [AI Voice Agent](https://loxiaai.com/en) platform, where AI can handle real-time interactions for business use cases like bookings, support, and sales automation.\n\nText-based chatbots solved the first wave of automation. But voice introduces:\n\nFaster interaction (no typing)\n\nHigher emotional expressiveness\n\nBetter accessibility\n\nNatural multitasking\n\nBusinesses are now adopting systems like [Voice AI for Business](https://loxiaai.com/en) to replace traditional call centers and static IVR menus.\n\nThe key challenge is not just speech-to-text, but building a low-latency conversational loop that feels human.\n\nA production-ready AI voice system typically consists of:\n\nFrontend (Next.js)\n\nAudio capture via Web Audio API\n\nStreaming audio chunks\n\nUI for conversation state\n\nBackend (Node.js / Edge Functions)\n\nSession management\n\nAuthentication\n\nTool execution layer\n\nAI Layer\n\nOpenAI Realtime API (streaming)\n\nFunction calling\n\nContext memory\n\nAudio Pipeline\n\nSpeech-to-text streaming\n\nText-to-speech streaming\n\nOptional noise cancellation\n\nThe core of a voice agent is a continuous loop:\n\nUser speaks\n\nAudio is streamed to server\n\nModel transcribes in real time\n\nModel generates response token-by-token\n\nResponse is converted to audio instantly\n\nAudio is played back with minimal delay\n\nThe goal is to keep latency under ~800ms for a natural experience.\n\nWe start by capturing microphone input:\n\n``` js\nconst stream = await navigator.mediaDevices.getUserMedia({ audio: true });\n\nconst audioContext = new AudioContext();\nconst source = audioContext.createMediaStreamSource(stream);\nconst processor = audioContext.createScriptProcessor(4096, 1, 1);\n\nsource.connect(processor);\nprocessor.connect(audioContext.destination);\n\nprocessor.onaudioprocess = (event) => {\n  const input = event.inputBuffer.getChannelData(0);\n  sendAudioChunk(input);\n};\n```\n\nThis allows us to continuously stream audio chunks to the backend.\n\nWe use WebSockets for low latency communication:\n\n``` js\nconst socket = new WebSocket(\"wss://your-server.com/audio\");\n\nfunction sendAudioChunk(chunk: Float32Array) {\n  socket.send(JSON.stringify({\n    type: \"audio_chunk\",\n    data: Array.from(chunk)\n  }));\n}\n```\n\nOn the server, we reconstruct the stream and forward it to the AI layer.\n\nThe core intelligence layer is powered by streaming model responses.\n\n``` js\nconst response = await openai.realtime.createSession({\n  model: \"gpt-5-realtime\",\n  modalities: [\"text\", \"audio\"],\n  instructions: `\n    You are a voice assistant for a business.\n    Be concise, natural, and conversational.\n  `\n});\n```\n\nThen we pipe:\n\nincoming audio → model\n\nmodel output → audio stream\n\nA voice agent becomes truly useful only when it can do things, not just talk.\n\nExample tools:\n\n``` js\nconst tools = [\n  {\n    name: \"check_availability\",\n    description: \"Check availability of a service\",\n    parameters: {\n      type: \"object\",\n      properties: {\n        date: { type: \"string\" },\n        service: { type: \"string\" }\n      }\n    }\n  }\n];\n```\n\nWhen the model detects intent, it calls tools automatically.\n\nThis is exactly how modern systems like AI-driven hospitality assistants operate behind the scenes.\n\nA serious limitation of naive voice bots is memory loss.\n\nWe solve this using:\n\n```\nSession-based memory\nSummarized conversation state\nStructured context injection\nconst sessionContext = {\n  userId,\n  historySummary,\n  preferences,\n  lastActions\n};\n```\n\nInstead of sending full transcripts, we compress context intelligently.\n\nLatency is everything in voice AI.\n\nTechniques:\n\nEven 200ms improvement significantly increases perceived “human-likeness”.\n\nWhen moving beyond prototypes:\n\nQueue system\n\nUse Redis or Kafka for audio buffering.\n\nHorizontal scaling\n\nStateless WebSocket servers.\n\nSession routing\n\nSticky sessions or session ID routing.\n\nMonitoring\n\nTrack:\n\nlatency per segment\n\ndrop rate\n\ntoken generation speed\n\nVoice systems handle sensitive data:\n\nEncrypt audio streams\n\nAvoid storing raw audio by default\n\nUse token-based authentication\n\nRate limit sessions\n\nThis architecture powers:\n\nCustomer support\n\nautomated FAQs\n\nticket creation\n\nSales assistants\n\nproduct recommendations\n\nlead qualification\n\nHospitality systems\n\nPlatforms like AI Voice Agent are used to replace front-desk interactions in hotels.\n\nE-commerce assistants\n\nVoice-based product discovery and checkout flows.\n\nTraditional chatbots:\n\nrequest → response\n\nhigh latency\n\nno voice continuity\n\nReal-time voice agents:\n\ncontinuous stream\n\ninterruptible responses\n\nemotional tone handling\n\naction execution\n\nThis is a fundamentally different system design.\n\n```\nMicrophone\n   ↓\nNext.js Client\n   ↓ (WebSocket stream)\nEdge Gateway\n   ↓\nRealtime AI Engine\n   ↓\nFunction Calling Layer\n   ↓\nExternal APIs (CRM, Booking, Payments)\n   ↓\nAudio Response Stream\n   ↓\nUser\n```\n\nBuilding a real-time voice AI system is no longer experimental—it’s becoming infrastructure.\n\nThe combination of streaming models, function calling, and modern web technologies makes it possible to build systems that behave less like software and more like digital operators.\n\nThe next step is not just building smarter bots, but building systems that can act in real time on behalf of users.", "url": "https://wpnews.pro/news/building-a-real-time-ai-voice-agent-with-openai-realtime-api-and-next-js", "canonical_source": "https://dev.to/loxia_ai/building-a-real-time-ai-voice-agent-with-openai-realtime-api-and-nextjs-bhn", "published_at": "2026-06-29 06:38:01+00:00", "updated_at": "2026-06-29 06:56:59.322507+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "developer-tools", "natural-language-processing"], "entities": ["OpenAI", "Next.js", "WebRTC", "Web Audio API", "WebSockets", "Node.js", "Loxia AI"], "alternates": {"html": "https://wpnews.pro/news/building-a-real-time-ai-voice-agent-with-openai-realtime-api-and-next-js", "markdown": "https://wpnews.pro/news/building-a-real-time-ai-voice-agent-with-openai-realtime-api-and-next-js.md", "text": "https://wpnews.pro/news/building-a-real-time-ai-voice-agent-with-openai-realtime-api-and-next-js.txt", "jsonld": "https://wpnews.pro/news/building-a-real-time-ai-voice-agent-with-openai-realtime-api-and-next-js.jsonld"}}