OpenAI GPT-Realtime-2: Complete Voice API Developer Guide (2026)

OpenAI released three new voice models on May 8, 2026, including GPT-Realtime-2, which brings GPT-5-class reasoning to voice interactions. The model features a 128K context window, reliable tool chaining, parallel tool calls, and configurable reasoning effort, enabling complex multi-step voice workflows and longer sessions. The other models, GPT-Realtime-Translate and GPT-Realtime-Whisper, address live translation and streaming speech-to-text respectively.

On May 8, 2026, OpenAI shipped three new voice models into its API — and the most significant of them changes what voice agents can actually do. GPT-Realtime-2 is the first voice model in the Realtime API family to carry GPT-5-class reasoning. That change unlocks a category of use cases that were previously impractical: complex multi-step voice workflows, reliable agentic tool calling during spoken interactions, and sessions long enough to handle real work. The other two models — GPT-Realtime-Translate and GPT-Realtime-Whisper — address two other gaps that have frustrated voice app developers since the original Realtime API launched. This guide covers all three, with the patterns and code you need to build production voice agents today. OpenAI released these models simultaneously on May 8: GPT-Realtime-2 — GPT-5-class reasoning for live voice conversations, with configurable reasoning effort, a 128K context window, parallel multi-tool calling, and natural interruption handling. GPT-Realtime-Translate — Live speech translation from 70+ input languages into 13 output languages, matching the speaker's pace with synthesized target-language audio. GPT-Realtime-Whisper — Streaming speech-to-text that generates transcript text live as the speaker talks, not batch after silence detection. Together they cover the three most common voice pipeline architectures: conversational AI, multilingual communication, and hybrid voice-plus-text workflows where you need a live transcript alongside a spoken response. The original Realtime API used a voice-optimized model that was fast and fluent but shallow in reasoning. It struggled with multi-step logic, complex tool chains, and tasks requiring state across more than a few exchanges. It was better at sounding natural than at being correct on hard problems. GPT-Realtime-2 inverts that priority. The underlying reasoning engine belongs to the same model family as GPT-5.5, which topped the May 2026 MMLU Pro and GPQA Diamond benchmarks. For voice agent developers, this means four concrete improvements: Reliable tool chaining. The model can call five tools in sequence, evaluate each result before calling the next, and maintain task context across the full chain without confabulating intermediate state. Parallel tool calls. GPT-Realtime-2 can issue multiple tool calls simultaneously and merge the results. A request like "book a meeting with all three of them tomorrow afternoon" fires three calendar API calls in parallel, not in sequence. Audible progress signals. During tool execution the model generates spoken filler matching what it's doing: "checking your calendar now" or "looking that up." This removes the dead air that made earlier voice agents feel broken during operations longer than 500ms. Stronger instruction following. System prompts with multi-clause constraints and conditional rules are reliably respected. Earlier Realtime models drifted from complex system prompts after four or five turns. The previous Realtime API supported 32K tokens. That sounds large until you factor in the real cost of a voice session: every exchange — question, tool call, result, response — adds tokens to the running context. A 30-minute customer support session with moderate tool use can exceed 32K and force external state management, which adds latency and architectural complexity. The 128K window makes 45–60 minute sessions practical without context stitching. For healthcare intake conversations, extended enterprise support workflows, and tutoring or coaching sessions, this is the change that makes the Realtime API production-viable without custom memory scaffolding. If you have been maintaining a separate summary-and-reinject loop to keep costs down, you can simplify or remove it entirely. GPT-Realtime-2 supports a reasoning effort parameter with values low , medium , and high . This directly controls both latency and cost: low : fastest response, minimal internal chain-of-thought. Best for FAQ-style queries, simple lookups, and conversational back-and-forth with no tool calls. medium : balanced — the default. Handles tool use and moderate task complexity reliably. high : full reasoning chain before responding. Use when correctness on complex multi-step logic matters more than response speed — medical triage, financial calculations, legal reasoning. For most deployments, routing simple turns to low and tool-heavy turns to medium cuts costs substantially without degrading quality where it matters. This is the same effort-routing principle covered in depth in the guide on managing agentic AI infrastructure costs https://dev.to/blogs/agentic-ai-cost-crisis-uber-budget-multi-model-routing-optimization-2026 . The Realtime API uses WebSockets. Here is a minimal but production-capable TypeScript implementation that handles session setup, tool calls, and audio streaming: python import WebSocket from 'ws' const OPENAI API KEY = process.env.OPENAI API KEY async function createVoiceAgent { const ws = new WebSocket 'wss://api.openai.com/v1/realtime', { headers: { Authorization: Bearer ${OPENAI API KEY} , 'OpenAI-Beta': 'realtime=v2', }, } ws.on 'open', = { ws.send JSON.stringify { type: 'session.update', session: { model: 'gpt-realtime-2', modalities: 'audio', 'text' , reasoning effort: 'medium', instructions: 'You are a helpful assistant. Keep responses concise.', tools: { type: 'function', name: 'get weather', description: 'Get current weather for a location', parameters: { type: 'object', properties: { location: { type: 'string', description: 'City and country' }, }, required: 'location' , }, }, , tool choice: 'auto', turn detection: { type: 'server vad', threshold: 0.5, silence duration ms: 800, }, }, } } ws.on 'message', raw: Buffer = { const event = JSON.parse raw.toString as Record handleEvent event, ws } } function handleEvent event: Record, ws: WebSocket { switch event.type { case 'response.audio.delta': streamAudioChunk event.delta as string break case 'response.function call arguments.done': executeTool event.name as string, JSON.parse event.arguments as string , event.call id as string, ws break case 'error': console.error 'Realtime error:', event.error break } } async function executeTool name: string, args: Record, callId: string, ws: WebSocket { let result: unknown = null if name === 'get weather' { result = await fetchWeather args.location as string } ws.send JSON.stringify { type: 'conversation.item.create', item: { type: 'function call output', call id: callId, output: JSON.stringify result }, } ws.send JSON.stringify { type: 'response.create' } } declare function streamAudioChunk delta: string : void declare function fetchWeather location: string : Promise createVoiceAgent Two details worth noting: the OpenAI-Beta: realtime=v2 header is required for GPT-Realtime-2 features — without it the endpoint falls back to v1 behavior and parallel tool calls will not work. The server vad turn detection lets the API detect when the user has finished speaking without you managing silence thresholds client-side. GPT-Realtime-Translate handles a specific and previously under-served workflow: real-time spoken translation where the translated speech keeps pace with the original speaker rather than lagging behind in batch segments. The model supports 70+ input languages and translates into 13 output languages. Output is synthesized audio in the target language with natural prosody — not a TTS read-out of a text translation. That distinction matters for user experience: a model that produces fluent spoken output in the target language feels qualitatively different from text passed through generic TTS. ws.send JSON.stringify { type: 'session.update', session: { model: 'gpt-realtime-translate', modalities: 'audio' , translation: { input language: 'auto', output language: 'es', }, voice: 'alloy', }, } Setting input language to auto enables automatic language detection per session. For deployments where the input language is known a French-to-English support line, for example , specifying it explicitly reduces latency by skipping the detection step. The 13 output languages include English, Spanish, French, German, Japanese, Portuguese, Italian, Dutch, Korean, Polish, Russian, Chinese Simplified , and Arabic. Practical use cases that are now cost-viable with this model: multilingual customer support without per-language agent variants, international voice agent deployments from a single codebase, and live event interpretation at the edge without a dedicated interpreter pool for lower-volume language pairs. GPT-Realtime-Whisper addresses a gap that forced many teams into fragile hybrid architectures. Previous voice pipelines had to choose between batch transcription accurate but delayed and real-time streaming alternatives with lower accuracy and complex integration paths. GPT-Realtime-Whisper streams partial transcripts as the speaker talks at accuracy comparable to Whisper Large V3 batch output. This enables lower-latency hybrid workflows: display a live transcript in a UI while the voice agent is responding, run keyword-based routing decisions in real time, and log structured conversation data without waiting for each turn to complete. For customer support deployments, a live transcript makes it practical to surface agent-assist recommendations to a human supervisor watching the session — a pattern that significantly reduces escalation rates in enterprise deployments. Published API pricing for these models as of May 2026: GPT-Realtime-2 : $40/hour for input audio, $80/hour for output audio at medium reasoning effort. low effort reduces input cost by approximately 40%. GPT-Realtime-Translate : $20/hour input, $40/hour for translated audio output. GPT-Realtime-Whisper : $6/hour for streaming transcription. For a customer support deployment with an 8-minute average call duration using GPT-Realtime-2 at medium effort, per-call cost runs approximately $0.05–$0.11 depending on the input-to-output audio ratio. Routing simpler calls to low effort consistently brings this below $0.04. At 10,000 calls per month, the difference between static medium and effort-routed sessions is roughly $700/month on inference alone — before accounting for latency improvements that reduce average handle time. If you are running gpt-4o-realtime-preview , migrating to GPT-Realtime-2 requires three targeted changes: Update the model ID to gpt-realtime-2 in your session configuration. Add the v2 header. Include OpenAI-Beta: realtime=v2 in your WebSocket connection headers. The v2 endpoint handles parallel tool calls and the new reasoning parameters — omitting it returns a 400 error for v2-only features. Review your context budget. If you built external context management to work around the 32K limit, you can simplify that logic — but do not remove it untested. Long-session workflows that never reset the conversation can still approach 128K on complex, tool-heavy calls. The event schema is backwards compatible with the original Realtime API for core message types response.audio.delta , conversation.item.create , session.update . New event types for parallel tool calls and reasoning progress are additive — existing event handlers will not break when you upgrade the model ID. These three models together close the gaps that have kept voice AI in proof-of-concept status at most organizations. GPT-Realtime-2's reliability on multi-step tool chains makes it viable for healthcare intake, financial services, and legal workflows where previous voice models had error rates too high to trust without mandatory human review. GPT-Realtime-Translate enables global deployments without per-language model engineering or prompt localization. GPT-Realtime-Whisper makes real-time supervision and logging practical without architectural workarounds. The highest-ROI first deployment for most organizations is customer support deflection. A voice agent built on GPT-Realtime-2 that handles the predictable 40–60% of support volume — account lookups, status checks, scheduling, FAQ — will show measurable deflection rates within 30 days of production deployment. Start there, instrument it with a live transcript from GPT-Realtime-Whisper for quality monitoring, then expand to higher-complexity query categories as the baseline matures. Originally published at wowhow.cloud