Building a Real-Time AI Voice Agent with OpenAI Realtime API and Next.js

A developer built a real-time AI voice agent using Next.js, WebRTC, and OpenAI's Realtime API, achieving sub-800ms latency for natural conversational interactions. The system captures audio via the Web Audio API, streams it over WebSockets, and uses OpenAI's streaming model for speech-to-text and text-to-speech. It includes function calling for business tasks like booking and support, and session-based memory to maintain context.

Voice interfaces are rapidly becoming the next major interaction layer after mobile and web UI. Instead of clicking, users will increasingly talk to systems that understand intent, context, and can execute actions in real time. In this article, we’ll build a production-grade architecture for a real-time AI voice system using modern web technologies such as Next.js, WebRTC, and OpenAI’s streaming capabilities. We’ll also explore how this architecture powers modern conversational systems like an AI Voice Agent https://loxiaai.com/en platform, where AI can handle real-time interactions for business use cases like bookings, support, and sales automation. Text-based chatbots solved the first wave of automation. But voice introduces: Faster interaction no typing Higher emotional expressiveness Better accessibility Natural multitasking Businesses are now adopting systems like Voice AI for Business https://loxiaai.com/en to replace traditional call centers and static IVR menus. The key challenge is not just speech-to-text, but building a low-latency conversational loop that feels human. A production-ready AI voice system typically consists of: Frontend Next.js Audio capture via Web Audio API Streaming audio chunks UI for conversation state Backend Node.js / Edge Functions Session management Authentication Tool execution layer AI Layer OpenAI Realtime API streaming Function calling Context memory Audio Pipeline Speech-to-text streaming Text-to-speech streaming Optional noise cancellation The core of a voice agent is a continuous loop: User speaks Audio is streamed to server Model transcribes in real time Model generates response token-by-token Response is converted to audio instantly Audio is played back with minimal delay The goal is to keep latency under ~800ms for a natural experience. We start by capturing microphone input: js const stream = await navigator.mediaDevices.getUserMedia { audio: true } ; const audioContext = new AudioContext ; const source = audioContext.createMediaStreamSource stream ; const processor = audioContext.createScriptProcessor 4096, 1, 1 ; source.connect processor ; processor.connect audioContext.destination ; processor.onaudioprocess = event = { const input = event.inputBuffer.getChannelData 0 ; sendAudioChunk input ; }; This allows us to continuously stream audio chunks to the backend. We use WebSockets for low latency communication: js const socket = new WebSocket "wss://your-server.com/audio" ; function sendAudioChunk chunk: Float32Array { socket.send JSON.stringify { type: "audio chunk", data: Array.from chunk } ; } On the server, we reconstruct the stream and forward it to the AI layer. The core intelligence layer is powered by streaming model responses. js const response = await openai.realtime.createSession { model: "gpt-5-realtime", modalities: "text", "audio" , instructions: You are a voice assistant for a business. Be concise, natural, and conversational. } ; Then we pipe: incoming audio → model model output → audio stream A voice agent becomes truly useful only when it can do things, not just talk. Example tools: js const tools = { name: "check availability", description: "Check availability of a service", parameters: { type: "object", properties: { date: { type: "string" }, service: { type: "string" } } } } ; When the model detects intent, it calls tools automatically. This is exactly how modern systems like AI-driven hospitality assistants operate behind the scenes. A serious limitation of naive voice bots is memory loss. We solve this using: Session-based memory Summarized conversation state Structured context injection const sessionContext = { userId, historySummary, preferences, lastActions }; Instead of sending full transcripts, we compress context intelligently. Latency is everything in voice AI. Techniques: Even 200ms improvement significantly increases perceived “human-likeness”. When moving beyond prototypes: Queue system Use Redis or Kafka for audio buffering. Horizontal scaling Stateless WebSocket servers. Session routing Sticky sessions or session ID routing. Monitoring Track: latency per segment drop rate token generation speed Voice systems handle sensitive data: Encrypt audio streams Avoid storing raw audio by default Use token-based authentication Rate limit sessions This architecture powers: Customer support automated FAQs ticket creation Sales assistants product recommendations lead qualification Hospitality systems Platforms like AI Voice Agent are used to replace front-desk interactions in hotels. E-commerce assistants Voice-based product discovery and checkout flows. Traditional chatbots: request → response high latency no voice continuity Real-time voice agents: continuous stream interruptible responses emotional tone handling action execution This is a fundamentally different system design. Microphone ↓ Next.js Client ↓ WebSocket stream Edge Gateway ↓ Realtime AI Engine ↓ Function Calling Layer ↓ External APIs CRM, Booking, Payments ↓ Audio Response Stream ↓ User Building a real-time voice AI system is no longer experimental—it’s becoming infrastructure. The combination of streaming models, function calling, and modern web technologies makes it possible to build systems that behave less like software and more like digital operators. The next step is not just building smarter bots, but building systems that can act in real time on behalf of users.