Eot-bench: Open benchmark suite for end-of-turn detection in voice AI LiveKit released eot-bench, an open benchmark suite for end-of-turn detection in voice AI, along with the first open dataset of real human-to-agent conversations in 14 languages. The benchmark evaluates models at real pauses under latency and interruption budgets, and LiveKit's Turn Detector v1 achieved the strongest overall results, with only 9.9% false cutoffs at 300 ms latency and 543 ms latency at a 5% false cutoff rate. Every voice agent has to answer the same question, over and over, on every pause: is the user done talking? Answer too early and the agent talks over people; answer too late and the conversation fills with dead air. End-of-turn EoT detection is the difference between an agent that feels like a conversation and one that feels like a walkie-talkie, and it has been one of the hardest open problems in voice AI since the first agents shipped. It has also been hard to measure as a field. There's a lot of strong work on end-of-turn detection, but no shared, public way to compare it: results come from different private datasets and different methodologies, which makes them difficult to reproduce or line up side by side. What's been missing is common ground. eot-bench is that common ground. It's an open, reproducible benchmark, paired with the first open dataset https://huggingface.co/datasets/livekit/eot-bench-data of real human-to-agent conversations in 14 languages. Instead of scoring models on isolated clips, it evaluates them the way a live voice agent does: at real pauses, under real latency and interruption budgets. We built it to evaluate LiveKit Turn Detector v1 https://livekit.com/blog/solving-end-of-turn-detection , and we're releasing it so anyone building an EoT model can measure on the same footing. is the first open dataset of its kind for end-of-turn detection: livekit/eot-bench-data real human-to-agent user turns , with aligned audio and textual context , across 14 languages : Arabic, Chinese, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Spanish, and Turkish. Each row is a complete user turn from a task-oriented conversation, annotated with every silence pause of at least 100 ms. The final pause is the true end of the turn; every earlier pause is a mid-turn hesitation the agent should listen through. That structure is what lets the benchmark score a model on the actual decisions a voice agent faces, rather than on isolated clips. It's freely available for download and evaluation, Apache-2.0 alongside this repo. LiveKit Turn Detector v1 posts the strongest overall results of any model we evaluated, in English and across all 14 languages. Explore the full interactive leaderboard » . Set a latency or false-cutoff budget and watch every model re-rank on the Pareto frontier and the per-language heatmap. The clearest single view is how much dead air each model leaves at a fixed interruption budget. Tuned to interrupt the user no more than 5% of the time, how long after the user has actually finished does the agent wait before responding? Lower means a snappier conversation. This is endpointing delay, not model inference time. English models at four operating points, ordered by false cutoffs at a 300 ms latency budget best first . Lower is better on every metric: | Model | False cutoffs @ 300 ms | False cutoffs @ 600 ms | Latency @ 5% cutoff | Latency @ 10% cutoff | |---|---|---|---|---| LiveKit Turn Detector v1 | 9.9% | 4.5% | 543 ms | 295 ms | | Deepgram Flux | 12.9% | 9.9% | 1151 ms | 548 ms | | ultraVAD | 27.7% | 11.9% | 899 ms | 663 ms | | LiveKit Turn Detector v1-mini | 27.8% | 12.1% | 1070 ms | 698 ms | | SmartTurn v3.2 | 35.2% | 14.8% | 1051 ms | 739 ms | | AssemblyAI | 49.4% | 14.6% | 1049 ms | 713 ms | | Soniox | – | 5.5% | 647 ms | 512 ms | | OpenAI GPT Realtime 2 | – | – | 1143 ms | 824 ms | | VAD baseline | 55.6% | 21.7% | 1600 ms | 1000 ms | A – means no policy setting reached that latency budget. The silence-only VAD baseline runs through the identical evaluation, so every learned and commercial detector is always measured against timing alone. All numbers are generated from the reproducible artifacts committed under output/ /livekit/eot-bench/blob/main/output . The interactive leaderboard https://livekit.com/benchmarks/eot-bench adds the full Pareto frontier and the breakdown across all 14 languages. A good turn detector has to satisfy two goals that pull against each other. The first is to never cut the user off . Interrupting before someone has finished a thought is the most jarring failure a voice agent can make, so the primary objective is to minimize the false-cutoff rate : firing on a mid-turn pause that wasn't actually the end of the turn. The second is to respond quickly once the user is done, so the conversation keeps flowing instead of filling with dead air. That's latency : the time the agent waits after a true turn ending before taking the floor. Latency here is conversational dead air, not compute. It's how long the policy holds before it's confident the turn is over, not how fast the model runs inference. An instant model that waits 600 ms to be sure still shows 600 ms of latency. Minimizing it means deciding correctly sooner , not computing faster. You can trivially win either goal alone: wait forever and you'll never interrupt; fire instantly and you'll never lag. What matters is the tradeoff between them. eot-bench measures that tradeoff directly. For every model it sweeps the endpointing policy, then reports the best latency achievable at a fixed false-cutoff budget and vice versa , plus the full Pareto frontier. Lower-left is better: fewer interruptions, faster responses. A single accuracy score can't capture this, which is why the benchmark is built around the tradeoff instead. See Evaluation Model evaluation-model for the full methodology. - The public turn-level dataset : real human-to-agent turns with audio and text context in 14 languages. livekit/eot-bench-data - Batch and streaming adapter interfaces for local models and provider APIs, with reference adapters for LiveKit Turn Detector v1 / v1-mini, Deepgram Flux, AssemblyAI, Soniox, OpenAI GPT Realtime, SmartTurn, and ultraVAD. - Reproducible prediction artifacts, policy-sweep metrics, Pareto frontiers, operating-point tables, and multilingual heatmaps committed under output/ . - CLI commands for running a new adapter against one language or every supported dataset language, plus a Modal runner for scaled batch jobs. From the repo root: python -m pip install -e ". dev " Regenerate the committed English comparison artifacts: eot-harness compare-models \ output/livekit eot-bench-data validation min silence 100ms/en Regenerate the multilingual operating-point artifacts: eot-harness compare-languages \ output/livekit eot-bench-data validation min silence 100ms The evaluation is built around the decision an EoT model has to support in a production voice agent: at each silence, should the assistant respond now or keep listening? Instead of treating EoT detection as offline classification over isolated clips, the harness evaluates complete turns as causal silence decisions, then compares the policies those scores can support. Each dataset row is a complete human user turn from a task-oriented conversation. The row includes every silence span of at least 100 ms. The final silence span is the true end of the user's turn and is labeled eot ; every earlier silence span is a mid-turn pause and is labeled hold . The harness asks each model to score those spans causally. For a prediction at time t , the adapter receives only the audio, transcript context, and messages that would have been available by t . This matters because EoT errors usually happen at ambiguous pauses inside a real turn, not at isolated clips where the model can implicitly rely on future context or offline segmentation. Span-level evaluation turns each user turn into the actual decision points a voice system sees. A good model assigns high EoT probability to the final silence while keeping probability low through ordinary hesitations and mid-turn pauses. Raw model scores are not enough to compare models. Production systems apply a policy on top of the score, and that policy determines the user-visible tradeoff between responding quickly and avoiding false cutoffs. The harness sweeps the policy space instead of judging one hand-picked threshold or timeout. The swept policy has three knobs: threshold : the EoT confidence needed to end the user turn. action delay : the minimum silence duration before the system is allowed to act on the model score. timeout : the maximum silence duration the system will hold before ending the turn even if the model has not fired. The harness evaluates these knobs together. Raising action delay can reduce false interruptions by ignoring short mid-turn pauses, but it adds the same latency to every correctly detected true turn ending. Raising timeout lets the system tolerate longer mid-turn pauses, but it makes false negatives more expensive because a missed true end-of-turn leaves the assistant waiting until the timeout expires. Comparison reports focus on operating points under explicit false-cutoff and latency budgets, plus the latency/cutoff Pareto frontier. They also include a VAD-only baseline evaluated on the same policy grid, so learned EoT models are compared against silence timing alone. Scalar classification metrics such as auc and ap remain available in per-run diagnostics, but they do not rank models or drive comparison reports. They answer a different question from deployment behavior and can be misleading for streaming APIs that expose events after an internal server-side hold or action delay. Short mid-turn pauses may look correctly rejected without putting the corresponding latency cost on an explicit policy knob. The policy sweep keeps that cost visible in the latency/cutoff frontier and named operating points. The harness expects an existing dataset in the public EoT benchmark schema. Dataset construction, annotation, VAD extraction, and source-specific ingestion live outside this package. Required dataset fields: id language audio silence spans Optional dataset fields: messages words audio is a Hugging Face Audio https://huggingface.co/docs/datasets/about dataset features audio-feature feature. The harness loads it with the column cast to Audio decode=False — so no torchcodec runtime decoder is required — and decodes the raw bytes itself with soundfile . Rows may therefore arrive either encoded as {"bytes": ...} or already decoded as {"array": ..., "sampling rate": ...} ; the harness handles both. silence spans is a list of spans with start and end seconds. Prediction generation skips spans shorter than 0.1s . Every generated span is labeled hold unless it is the final span in silence spans , which is labeled eot . If messages is present, it is copied into each causal adapter input. If words is present, the harness appends a current-turn user message containing words whose end time is at or before timestamp - transcript lag . When both fields are absent, adapters receive an empty messages list. The language column is always required. The harness does not interpret the dataset config name after loading; it writes artifacts under one child directory per observed row language. From the repo root: python -m pip install -e . For local development and tests: python -m pip install -e ". dev " Runtime dependencies cover the core harness, Hugging Face dataset I/O, the Modal runner, plotting, and the Deepgram streaming client. requirements.txt mirrors those runtime dependencies for environments that prefer requirements files. The local LiveKit, Smart Turn, and UltraVAD model adapters import heavier model runtimes lazily, such as livekit-local-inference , transformers , onnxruntime , torch , and torchaudio . Install those separately for local model runs, or use the Modal runner presets, which build images with the needed model dependencies. The CLI and Modal runner load auth from eot harness/.env with python-dotenv . Copy eot harness/.env.example to eot harness/.env and use the canonical key names: HF TOKEN=... LIVEKIT API KEY=... LIVEKIT API SECRET=... Optional: override the inference gateway e.g. a local server . LIVEKIT INFERENCE URL=http://localhost:8080/v1 DEEPGRAM API KEY=... ASSEMBLYAI API KEY=... SONIOX API KEY=... XAI API KEY=... SPEECHMATICS API KEY=... OPENAI API KEY=... Batch Prediction Run a pointwise batch adapter: eot-harness predict \ --path livekit/eot-bench-data \ --name all \ --split validation \ --adapter eot harness.livekit turn detector mini adapter:LiveKitTurnDetectorMiniAdapter \ --output-dir output --repo-id and --subset remain accepted aliases for --path and --name . Useful options: --min-silence-span : defaults to 0.1 seconds and defines the dataset span set for prediction. --batch-size : defaults to 128 . --inference-interval : defaults to 0.1 seconds. --transcript-lag : defaults to 0.5 seconds. --overwrite : replace an existing model run with the same model options. Batch prediction writes a span-set parent directory under --output-dir : output/livekit eot-bench-data validation min silence 100ms/ en/ span set.parquet span set manifest.json live kit turn detector mini adapter