cd /news/artificial-intelligence/building-a-real-time-ai-voice-agent-… · home topics artificial-intelligence article
[ARTICLE · art-43055] src=dev.to ↗ pub= topic=artificial-intelligence verified=true sentiment=↑ positive

Building a Real-Time AI Voice Agent with OpenAI Realtime API and Next.js

A developer built a real-time AI voice agent using Next.js, WebRTC, and OpenAI's Realtime API, achieving sub-800ms latency for natural conversational interactions. The system captures audio via the Web Audio API, streams it over WebSockets, and uses OpenAI's streaming model for speech-to-text and text-to-speech. It includes function calling for business tasks like booking and support, and session-based memory to maintain context.

read4 min views1 publishedJun 29, 2026

Voice interfaces are rapidly becoming the next major interaction layer after mobile and web UI. Instead of clicking, users will increasingly talk to systems that understand intent, context, and can execute actions in real time.

In this article, we’ll build a production-grade architecture for a real-time AI voice system using modern web technologies such as Next.js, WebRTC, and OpenAI’s streaming capabilities.

We’ll also explore how this architecture powers modern conversational systems like an AI Voice Agent platform, where AI can handle real-time interactions for business use cases like bookings, support, and sales automation.

Text-based chatbots solved the first wave of automation. But voice introduces:

Faster interaction (no typing)

Higher emotional expressiveness

Better accessibility

Natural multitasking

Businesses are now adopting systems like Voice AI for Business to replace traditional call centers and static IVR menus.

The key challenge is not just speech-to-text, but building a low-latency conversational loop that feels human.

A production-ready AI voice system typically consists of:

Frontend (Next.js)

Audio capture via Web Audio API

Streaming audio chunks

UI for conversation state

Backend (Node.js / Edge Functions)

Session management

Authentication

Tool execution layer

AI Layer

OpenAI Realtime API (streaming)

Function calling

Context memory

Audio Pipeline

Speech-to-text streaming

Text-to-speech streaming

Optional noise cancellation

The core of a voice agent is a continuous loop:

User speaks

Audio is streamed to server

Model transcribes in real time

Model generates response token-by-token

Response is converted to audio instantly

Audio is played back with minimal delay

The goal is to keep latency under ~800ms for a natural experience.

We start by capturing microphone input:

const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(stream);
const processor = audioContext.createScriptProcessor(4096, 1, 1);

source.connect(processor);
processor.connect(audioContext.destination);

processor.onaudioprocess = (event) => {
  const input = event.inputBuffer.getChannelData(0);
  sendAudioChunk(input);
};

This allows us to continuously stream audio chunks to the backend.

We use WebSockets for low latency communication:

const socket = new WebSocket("wss://your-server.com/audio");

function sendAudioChunk(chunk: Float32Array) {
  socket.send(JSON.stringify({
    type: "audio_chunk",
    data: Array.from(chunk)
  }));
}

On the server, we reconstruct the stream and forward it to the AI layer.

The core intelligence layer is powered by streaming model responses.

const response = await openai.realtime.createSession({
  model: "gpt-5-realtime",
  modalities: ["text", "audio"],
  instructions: `
    You are a voice assistant for a business.
    Be concise, natural, and conversational.
  `
});

Then we pipe:

incoming audio → model

model output → audio stream

A voice agent becomes truly useful only when it can do things, not just talk.

Example tools:

const tools = [
  {
    name: "check_availability",
    description: "Check availability of a service",
    parameters: {
      type: "object",
      properties: {
        date: { type: "string" },
        service: { type: "string" }
      }
    }
  }
];

When the model detects intent, it calls tools automatically.

This is exactly how modern systems like AI-driven hospitality assistants operate behind the scenes.

A serious limitation of naive voice bots is memory loss.

We solve this using:

Session-based memory
Summarized conversation state
Structured context injection
const sessionContext = {
  userId,
  historySummary,
  preferences,
  lastActions
};

Instead of sending full transcripts, we compress context intelligently.

Latency is everything in voice AI.

Techniques:

Even 200ms improvement significantly increases perceived “human-likeness”.

When moving beyond prototypes:

Queue system

Use Redis or Kafka for audio buffering.

Horizontal scaling

Stateless WebSocket servers.

Session routing

Sticky sessions or session ID routing.

Monitoring

Track:

latency per segment

drop rate

token generation speed

Voice systems handle sensitive data:

Encrypt audio streams

Avoid storing raw audio by default

Use token-based authentication

Rate limit sessions

This architecture powers:

Customer support

automated FAQs

ticket creation

Sales assistants

product recommendations

lead qualification

Hospitality systems

Platforms like AI Voice Agent are used to replace front-desk interactions in hotels.

E-commerce assistants

Voice-based product discovery and checkout flows.

Traditional chatbots:

request → response

high latency

no voice continuity

Real-time voice agents:

continuous stream

interruptible responses

emotional tone handling

action execution

This is a fundamentally different system design.

Microphone
   ↓
Next.js Client
   ↓ (WebSocket stream)
Edge Gateway
   ↓
Realtime AI Engine
   ↓
Function Calling Layer
   ↓
External APIs (CRM, Booking, Payments)
   ↓
Audio Response Stream
   ↓
User

Building a real-time voice AI system is no longer experimental—it’s becoming infrastructure.

The combination of streaming models, function calling, and modern web technologies makes it possible to build systems that behave less like software and more like digital operators.

The next step is not just building smarter bots, but building systems that can act in real time on behalf of users.

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @openai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/building-a-real-time…] indexed:0 read:4min 2026-06-29 ·