{"slug": "eot-bench-open-benchmark-suite-for-end-of-turn-detection-in-voice-ai", "title": "Eot-bench: Open benchmark suite for end-of-turn detection in voice AI", "summary": "LiveKit released eot-bench, an open benchmark suite for end-of-turn detection in voice AI, along with the first open dataset of real human-to-agent conversations in 14 languages. The benchmark evaluates models at real pauses under latency and interruption budgets, and LiveKit's Turn Detector v1 achieved the strongest overall results, with only 9.9% false cutoffs at 300 ms latency and 543 ms latency at a 5% false cutoff rate.", "body_md": "Every voice agent has to answer the same question, over and over, on every\npause: **is the user done talking?** Answer too early and the agent talks over\npeople; answer too late and the conversation fills with dead air. End-of-turn\n(EoT) detection is the difference between an agent that feels like a\nconversation and one that feels like a walkie-talkie, and it has been one of the\nhardest open problems in voice AI since the first agents shipped.\n\nIt has also been hard to measure as a field. There's a lot of strong work on end-of-turn detection, but no shared, public way to compare it: results come from different private datasets and different methodologies, which makes them difficult to reproduce or line up side by side. What's been missing is common ground.\n\n**eot-bench is that common ground.** It's an open, reproducible benchmark, paired\nwith the first open [dataset](https://huggingface.co/datasets/livekit/eot-bench-data)\nof real human-to-agent conversations in 14 languages. Instead of scoring models\non isolated clips, it evaluates them the way a live voice agent does: at real\npauses, under real latency and interruption budgets. We built it to evaluate\n[LiveKit Turn Detector v1](https://livekit.com/blog/solving-end-of-turn-detection),\nand we're releasing it so anyone building an EoT model can measure on the same\nfooting.\n\nis\nthe first open dataset of its kind for end-of-turn detection:\n\n`livekit/eot-bench-data`\n\n**real human-to-agent user turns**, with\n\n**aligned audio and textual context**, across\n\n**14 languages**: Arabic, Chinese, Dutch, English, French, German, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Spanish, and Turkish.\n\nEach row is a complete user turn from a task-oriented conversation, annotated\nwith every silence pause of at least 100 ms. The final pause is the true end of\nthe turn; every earlier pause is a mid-turn hesitation the agent should listen\nthrough. That structure is what lets the benchmark score a model on the *actual*\ndecisions a voice agent faces, rather than on isolated clips. It's freely\navailable for download and evaluation, Apache-2.0 alongside this repo.\n\nLiveKit Turn Detector v1 posts the strongest overall results of any model we\nevaluated, in English and across all 14 languages. Explore the full\n** interactive leaderboard »**. Set a\nlatency or false-cutoff budget and watch every model re-rank on the Pareto\nfrontier and the per-language heatmap.\n\nThe clearest single view is how much dead air each model leaves at a fixed interruption budget. Tuned to interrupt the user no more than 5% of the time, how long after the user has actually finished does the agent wait before responding? Lower means a snappier conversation. (This is endpointing delay, not model inference time.)\n\nEnglish models at four operating points, ordered by false cutoffs at a 300 ms latency budget (best first). Lower is better on every metric:\n\n| Model | False cutoffs @ 300 ms | False cutoffs @ 600 ms | Latency @ 5% cutoff | Latency @ 10% cutoff |\n|---|---|---|---|---|\nLiveKit Turn Detector v1 |\n9.9% |\n4.5% |\n543 ms |\n295 ms |\n| Deepgram Flux | 12.9% | 9.9% | 1151 ms | 548 ms |\n| ultraVAD | 27.7% | 11.9% | 899 ms | 663 ms |\n| LiveKit Turn Detector v1-mini | 27.8% | 12.1% | 1070 ms | 698 ms |\n| SmartTurn v3.2 | 35.2% | 14.8% | 1051 ms | 739 ms |\n| AssemblyAI | 49.4% | 14.6% | 1049 ms | 713 ms |\n| Soniox | – | 5.5% | 647 ms | 512 ms |\n| OpenAI GPT Realtime 2 | – | – | 1143 ms | 824 ms |\n| VAD baseline | 55.6% | 21.7% | 1600 ms | 1000 ms |\n\nA `–`\n\nmeans no policy setting reached that latency budget. The silence-only\n**VAD baseline** runs through the identical evaluation, so every learned and\ncommercial detector is always measured against timing alone. All numbers are\ngenerated from the reproducible artifacts committed under [ output/](/livekit/eot-bench/blob/main/output).\nThe\n\n[interactive leaderboard](https://livekit.com/benchmarks/eot-bench)adds the full Pareto frontier and the breakdown across all 14 languages.\n\nA good turn detector has to satisfy two goals that pull against each other.\n\nThe first is to **never cut the user off**. Interrupting before someone has\nfinished a thought is the most jarring failure a voice agent can make, so the\nprimary objective is to minimize the **false-cutoff rate**: firing on a mid-turn\npause that wasn't actually the end of the turn. The second is to **respond\nquickly** once the user *is* done, so the conversation keeps flowing instead of\nfilling with dead air. That's **latency**: the time the agent waits after a true\nturn ending before taking the floor.\n\nLatency here is conversational dead air, not compute. It's how long the policy\nholds before it's confident the turn is over, not how fast the model runs\ninference. An instant model that waits 600 ms to be sure still shows 600 ms of\nlatency. Minimizing it means deciding correctly *sooner*, not computing faster.\n\nYou can trivially win either goal alone: wait forever and you'll never interrupt;\nfire instantly and you'll never lag. What matters is the **tradeoff** between\nthem. eot-bench measures that tradeoff directly. For every model it sweeps the\nendpointing policy, then reports the best latency achievable at a fixed\nfalse-cutoff budget (and vice versa), plus the full Pareto frontier. Lower-left\nis better: fewer interruptions, faster responses. A single accuracy score can't\ncapture this, which is why the benchmark is built around the tradeoff instead.\nSee [Evaluation Model](#evaluation-model) for the full methodology.\n\n- The public turn-level dataset\n: real human-to-agent turns with audio and text context in 14 languages.`livekit/eot-bench-data`\n\n- Batch and streaming adapter interfaces for local models and provider APIs, with reference adapters for LiveKit Turn Detector v1 / v1-mini, Deepgram Flux, AssemblyAI, Soniox, OpenAI GPT Realtime, SmartTurn, and ultraVAD.\n- Reproducible prediction artifacts, policy-sweep metrics, Pareto frontiers,\noperating-point tables, and multilingual heatmaps committed under\n`output/`\n\n. - CLI commands for running a new adapter against one language or every supported dataset language, plus a Modal runner for scaled batch jobs.\n\nFrom the repo root:\n\n```\npython -m pip install -e \".[dev]\"\n```\n\nRegenerate the committed English comparison artifacts:\n\n```\neot-harness compare-models \\\n  output/livekit__eot-bench-data__validation__min_silence_100ms/en\n```\n\nRegenerate the multilingual operating-point artifacts:\n\n```\neot-harness compare-languages \\\n  output/livekit__eot-bench-data__validation__min_silence_100ms\n```\n\nThe evaluation is built around the decision an EoT model has to support in a production voice agent: at each silence, should the assistant respond now or keep listening? Instead of treating EoT detection as offline classification over isolated clips, the harness evaluates complete turns as causal silence decisions, then compares the policies those scores can support.\n\nEach dataset row is a complete human user turn from a task-oriented\nconversation. The row includes every silence span of at least 100 ms. The final\nsilence span is the true end of the user's turn and is labeled `eot`\n\n; every\nearlier silence span is a mid-turn pause and is labeled `hold`\n\n.\n\nThe harness asks each model to score those spans causally. For a prediction at\ntime `t`\n\n, the adapter receives only the audio, transcript context, and messages\nthat would have been available by `t`\n\n. This matters because EoT errors\nusually happen at ambiguous pauses inside a real turn, not at isolated clips\nwhere the model can implicitly rely on future context or offline segmentation.\n\nSpan-level evaluation turns each user turn into the actual decision points a voice system sees. A good model assigns high EoT probability to the final silence while keeping probability low through ordinary hesitations and mid-turn pauses.\n\nRaw model scores are not enough to compare models. Production systems apply a policy on top of the score, and that policy determines the user-visible tradeoff between responding quickly and avoiding false cutoffs. The harness sweeps the policy space instead of judging one hand-picked threshold or timeout.\n\nThe swept policy has three knobs:\n\n`threshold`\n\n: the EoT confidence needed to end the user turn.`action_delay`\n\n: the minimum silence duration before the system is allowed to act on the model score.`timeout`\n\n: the maximum silence duration the system will hold before ending the turn even if the model has not fired.\n\nThe harness evaluates these knobs together. Raising `action_delay`\n\ncan reduce\nfalse interruptions by ignoring short mid-turn pauses, but it adds the same\nlatency to every correctly detected true turn ending. Raising `timeout`\n\nlets the\nsystem tolerate longer mid-turn pauses, but it makes false negatives more\nexpensive because a missed true end-of-turn leaves the assistant waiting until\nthe timeout expires.\n\nComparison reports focus on operating points under explicit false-cutoff and latency budgets, plus the latency/cutoff Pareto frontier. They also include a VAD-only baseline evaluated on the same policy grid, so learned EoT models are compared against silence timing alone.\n\nScalar classification metrics such as `auc`\n\nand `ap`\n\nremain available in\nper-run diagnostics, but they do not rank models or drive comparison reports.\nThey answer a different question from deployment behavior and can be misleading\nfor streaming APIs that expose events after an internal server-side hold or\naction delay. Short mid-turn pauses may look correctly rejected without putting\nthe corresponding latency cost on an explicit policy knob. The policy sweep\nkeeps that cost visible in the latency/cutoff frontier and named operating\npoints.\n\nThe harness expects an existing dataset in the public EoT benchmark schema. Dataset construction, annotation, VAD extraction, and source-specific ingestion live outside this package.\n\nRequired dataset fields:\n\n`id`\n\n`language`\n\n`audio`\n\n`silence_spans`\n\nOptional dataset fields:\n\n`messages`\n\n`words`\n\n`audio`\n\nis a Hugging Face [ Audio](https://huggingface.co/docs/datasets/about_dataset_features#audio-feature)\nfeature. The harness loads it with the column cast to\n\n`Audio(decode=False)`\n\n— so\nno `torchcodec`\n\nruntime decoder is required — and decodes the raw bytes itself\nwith `soundfile`\n\n. Rows may therefore arrive either encoded as `{\"bytes\": ...}`\n\nor already decoded as `{\"array\": ..., \"sampling_rate\": ...}`\n\n; the harness handles\nboth.`silence_spans`\n\nis a list of spans with `start`\n\nand `end`\n\nseconds. Prediction\ngeneration skips spans shorter than `0.1s`\n\n. Every generated span is labeled\n`hold`\n\nunless it is the final span in `silence_spans`\n\n, which is labeled `eot`\n\n.\n\nIf `messages`\n\nis present, it is copied into each causal adapter input. If\n`words`\n\nis present, the harness appends a current-turn user message containing\nwords whose `end`\n\ntime is at or before `timestamp - transcript_lag`\n\n. When both\nfields are absent, adapters receive an empty `messages`\n\nlist.\n\nThe `language`\n\ncolumn is always required. The harness does not interpret the\ndataset config name after loading; it writes artifacts under one child directory\nper observed row language.\n\nFrom the repo root:\n\n```\npython -m pip install -e .\n```\n\nFor local development and tests:\n\n```\npython -m pip install -e \".[dev]\"\n```\n\nRuntime dependencies cover the core harness, Hugging Face dataset I/O, the\nModal runner, plotting, and the Deepgram streaming client. `requirements.txt`\n\nmirrors those runtime dependencies for environments that prefer requirements\nfiles. The local LiveKit, Smart Turn, and UltraVAD model adapters import heavier\nmodel runtimes lazily, such as `livekit-local-inference`\n\n, `transformers`\n\n,\n`onnxruntime`\n\n, `torch`\n\n, and `torchaudio`\n\n. Install those separately for local\nmodel runs, or use the Modal runner presets, which build images with the needed\nmodel dependencies.\n\nThe CLI and Modal runner load auth from `eot_harness/.env`\n\nwith\n`python-dotenv`\n\n. Copy `eot_harness/.env.example`\n\nto `eot_harness/.env`\n\nand use\nthe canonical key names:\n\n```\nHF_TOKEN=...\nLIVEKIT_API_KEY=...\nLIVEKIT_API_SECRET=...\n# Optional: override the inference gateway (e.g. a local server).\n# LIVEKIT_INFERENCE_URL=http://localhost:8080/v1\nDEEPGRAM_API_KEY=...\nASSEMBLYAI_API_KEY=...\nSONIOX_API_KEY=...\nXAI_API_KEY=...\nSPEECHMATICS_API_KEY=...\nOPENAI_API_KEY=...\n```\n\n**Batch Prediction**\n\nRun a pointwise batch adapter:\n\n```\neot-harness predict \\\n  --path livekit/eot-bench-data \\\n  --name all \\\n  --split validation \\\n  --adapter eot_harness.livekit_turn_detector_mini_adapter:LiveKitTurnDetectorMiniAdapter \\\n  --output-dir output\n```\n\n`--repo-id`\n\nand `--subset`\n\nremain accepted aliases for `--path`\n\nand `--name`\n\n.\n\nUseful options:\n\n`--min-silence-span`\n\n: defaults to`0.1`\n\nseconds and defines the dataset span set for prediction.`--batch-size`\n\n: defaults to`128`\n\n.`--inference-interval`\n\n: defaults to`0.1`\n\nseconds.`--transcript-lag`\n\n: defaults to`0.5`\n\nseconds.`--overwrite`\n\n: replace an existing model run with the same model options.\n\nBatch prediction writes a span-set parent directory under `--output-dir`\n\n:\n\n```\noutput/livekit__eot-bench-data__validation__min_silence_100ms/\n  en/\n    span_set.parquet\n    span_set_manifest.json\n    live_kit_turn_detector_mini_adapter__<options-hash>/\n      predictions.parquet\n      manifest.json\n  de/\n    span_set.parquet\n    span_set_manifest.json\n    live_kit_turn_detector_mini_adapter__<options-hash>/\n      predictions.parquet\n      manifest.json\n```\n\nThe span-set root directory is determined by `path`\n\n, `split`\n\n, and\n`min_silence_span`\n\n. The dataset `name`\n\nis only passed through to\n`load_dataset`\n\n; row `language`\n\nvalues determine child directories. The model\ndirectory is determined by the adapter name and a hash of prediction-affecting\nmodel options. The harness validates that each model run's predictions contain\nexactly the parent language span set, defined as the unique `(id, span_index)`\n\npairs.\n\nAdapters expose `supports_language(lang_code)`\n\n. When the loaded dataset has\nmultiple languages, unsupported languages are skipped before inference. With\n`--overwrite`\n\nunset, complete existing language/model artifacts are skipped too,\nso rerunning `--name all`\n\nfills only missing supported languages.\n\nPrediction commands log progress to stderr every 30 seconds by default,\nincluding elapsed time, throughput, and ETA. Use `--progress-interval 0`\n\nto\ndisable progress logs.\n\nEach model run writes:\n\n`predictions.parquet`\n\n`manifest.json`\n\n`predictions.parquet`\n\nhas one row per scored timestamp:\n\n`id`\n\n`language`\n\n`span_index`\n\n`timestamp`\n\n`silence_dur`\n\n`p_eot`\n\n`label`\n\n**Streaming Prediction**\n\nRun a streaming/API adapter:\n\n```\neot-harness predict-streaming \\\n  --path livekit/eot-bench-data \\\n  --name all \\\n  --split validation \\\n  --adapter eot_harness.livekit_turn_detector_adapter:LiveKitTurnDetectorAdapter \\\n  --output-dir output\n```\n\nThe LiveKit cloud audio adapter streams each turn to the LiveKit Turn Detector v1\n(`turn-detector-v1`\n\n) model over the agent-gateway EOT websocket protocol. For every scored silence-span grid\npoint it feeds audio up to that timestamp and issues an explicit inference\nrequest, so each `p_eot`\n\nreflects exactly the causal audio the harness exposes.\nIt requires `LIVEKIT_API_KEY`\n\nand `LIVEKIT_API_SECRET`\n\n; set `LIVEKIT_INFERENCE_URL`\n\nto target a non-default gateway (e.g. `http://localhost:8080/v1`\n\nfor a local\nserver).\n\nThe streaming command loads full dataset rows and lets the adapter produce\nprediction rows for each turn. It writes the same per-language span-set/model-run\nlayout as batch prediction and skips complete existing language artifacts when\n`--overwrite`\n\nis unset. It supports:\n\n`--concurrency`\n\n: defaults to the adapter's`concurrency`\n\nattribute, or`1`\n\n.`--model`\n\n: adapter model override when supported.`--chunk-ms`\n\n: streaming chunk size override when supported.`--eot-threshold`\n\n: Deepgram Flux EoT threshold override.`--limit`\n\n: only score the first N dataset rows, useful for API smoke tests.`--overwrite`\n\n: replace existing language/model run artifacts.`--progress-interval`\n\n: seconds between stderr progress logs; use`0`\n\nto disable.`--skip-unsupported-languages`\n\n: backward-compatible language skipping flag; unsupported rows are filtered before API calls when the adapter exposes`supports_language(lang_code)`\n\n.`--skip-errors`\n\n: record failed streaming rows in`skipped.parquet`\n\ninstead of aborting the whole run.\n\nStreaming API adapters require their provider-specific credentials in\n`eot_harness/.env`\n\n, such as `LIVEKIT_API_KEY`\n\n/`LIVEKIT_API_SECRET`\n\n,\n`DEEPGRAM_API_KEY`\n\n, `ASSEMBLYAI_API_KEY`\n\n,\n`SONIOX_API_KEY`\n\n, `XAI_API_KEY`\n\n, `SPEECHMATICS_API_KEY`\n\n, or `OPENAI_API_KEY`\n\n.\nThe AssemblyAI adapter also accepts `ASSEMBLY_API_KEY`\n\nand `ASSEMBLY_AI_KEY`\n\nas\nlocal aliases.\n\nEach streaming model run writes:\n\n`predictions.parquet`\n\n`events.parquet`\n\n`manifest.json`\n\n`summary.json`\n\n`skipped.parquet`\n\n, only when rows are skipped\n\n**Metrics**\n\nCompute metrics from any conforming `predictions.parquet`\n\n:\n\n```\neot-harness compute-metrics \\\n  --predictions output/livekit__eot-bench-data__validation__min_silence_100ms/en/live_kit_turn_detector_adapter__OPTIONS_HASH/predictions.parquet \\\n  --score-point 0.2 \\\n  --output-dir output/livekit__eot-bench-data__validation__min_silence_100ms/en/live_kit_turn_detector_adapter__OPTIONS_HASH/metrics\n```\n\nMetrics output:\n\n`tradeoff.parquet`\n\n`summary.json`\n\nMetric defaults:\n\n`--score-point`\n\ndefaults to the prediction manifest's`score_point`\n\nwhen present. If neither is available, scalar metrics use each span's max score and policy metrics use the first threshold crossing in each span.`--min-hold-span-duration`\n\ndefaults to`0.2`\n\nseconds.`--max-hold-span-duration`\n\ndefaults to`5.0`\n\nseconds.\n\nOnly hold spans are filtered by the min/max hold-span duration. True EoT spans\nare kept and must have a score. When `--score-point`\n\nis set, that score point\nmust exist on the prediction grid for every span that survives to that duration;\nfor example, `--score-point 0.2`\n\nis compatible with the default `0.1s`\n\ninference interval.\n\nThe per-run metric summary includes scalar classification diagnostics (`auc`\n\n,\n`ap`\n\n) and a joint threshold/action-delay/timeout sweep with a VAD-only baseline.\nUse the sweep-derived operating points for model comparisons; `auc`\n\nand `ap`\n\nare diagnostic only and are intentionally omitted from comparison reports.\n\n**Model Comparison**\n\nBuild a comparison report from the per-model metric artifacts under a span-set directory:\n\n```\neot-harness compare-models \\\n  output/livekit__eot-bench-data__validation__min_silence_100ms/en\n```\n\n`compare-models`\n\ndoes not recompute metrics. It discovers model run directories\nwith `metrics/tradeoff.parquet`\n\nand `metrics/summary.json`\n\n, validates that their\nmin/max hold-span duration settings match the command options, orders models by\nbest mean latency under the 5% false-cutoff budget, and uses `display_name`\n\nfrom\neach prediction manifest when present. It writes:\n\n```\ncomparison/\n  report.md\n  pareto_frontier.png\n  cutoff_rate_at_latency_budget_300_600ms.png\n  latency_at_cutoff_budget_5_10pct.png\n```\n\nThe report puts the Pareto frontier first, followed by operating-point bar charts and tables for:\n\n- best cutoff rate at 0.3s and 0.6s latency budgets\n- best mean latency at 5% and 10% false-cutoff budgets\n\nEach operating-point table includes the VAD baseline and bolds the best value in\neach metric column. The final section contains the full operating-point table.\nThe Markdown report uses GitHub-renderable image references and tables.\n`compare-models`\n\ndoes not write parquet or JSON artifacts into `comparison/`\n\n;\nregenerating the report removes stale comparison-side parquet/JSON files from\nolder harness versions.\n\n**Language Comparison**\n\nBuild operating-point heatmaps across every language directory under a span-set root:\n\n```\neot-harness compare-languages \\\n  output/livekit__eot-bench-data__validation__min_silence_100ms\n```\n\n`compare-languages`\n\nreads existing per-language model metrics and does not\nrecompute predictions or metrics. It writes the same policy operating points\nused by `compare-models`\n\ninto one report:\n\n```\nlanguage_comparison/\n  summary.json\n  metrics.parquet\n  heatmap_best_cutoff_rate_at_0_3s_latency.png\n  heatmap_best_cutoff_rate_at_0_6s_latency.png\n  heatmap_best_mean_latency_at_5pct_cutoff.png\n  heatmap_best_mean_latency_at_10pct_cutoff.png\n  report.md\n```\n\nThe heatmaps use the fully spelled-out operating-point names:\n\n- Best cutoff rate @ 0.3s latency\n- Best cutoff rate @ 0.6s latency\n- Best mean latency @ 5% cutoff\n- Best mean latency @ 10% cutoff\n\nWhen a language directory includes `span_set.parquet`\n\n, the language comparison\nalso computes a fine-grid VAD baseline so it can appear beside learned and API\nmodels in the heatmaps and metrics table.\n\n**Adapter Contracts**\n\nAdapter references may be written as `module:attribute`\n\nor `module.attribute`\n\n.\nIf the referenced attribute is a class, the harness instantiates it with no\narguments.\n\nBatch adapters must define:\n\n```\nclass MyAdapter:\n    adapter_id = \"my-model\"\n    score_point = 0.2\n\n    def predict_batch(self, batch):\n        return [0.0 for _ in batch]\n```\n\nEach `batch`\n\nitem contains:\n\n`audio`\n\n: the causal audio prefix for the prediction timestamp.`messages`\n\n: the causal message history after transcript-lag handling, or an empty list when the dataset has no text fields.\n\n`predict_batch`\n\nmust return one `p_eot`\n\nscore per input.\n\nStreaming adapters must define:\n\n```\nclass MyStreamingAdapter:\n    adapter_id = \"my-streaming-model\"\n    concurrency = 4\n\n    async def predict_turn(self, row, *, inference_interval):\n        return {\n            \"id\": row[\"id\"],\n            \"audio_sec\": 0.0,\n            \"events\": [],\n            \"prediction_rows\": [],\n        }\n```\n\n`prediction_rows`\n\nmust use the same required score columns as batch\n`predictions.parquet`\n\n: `id`\n\n, `span_index`\n\n, `timestamp`\n\n, `silence_dur`\n\n, `p_eot`\n\n,\nand `label`\n\n. The harness fills `language`\n\nfrom the dataset row when a streaming\nadapter omits it.\nStreaming adapters can expose `supports_language(lang_code)`\n\nand should use\n`row[\"language\"]`\n\nfor provider-specific language hints or request parameters.\nAdapters may return `{\"skipped\": True, \"id\": ..., \"reason\": ...}`\n\nfor skipped\nturns.\n\nBuilt-in adapter examples:\n\n`eot_harness.livekit_turn_detector_adapter:LiveKitTurnDetectorAdapter`\n\n`eot_harness.livekit_turn_detector_mini_adapter:LiveKitTurnDetectorMiniAdapter`\n\n`eot_harness.smart_turn_adapter:SmartTurnAudioAdapter`\n\n`eot_harness.ultravad_adapter:UltraVADAdapter`\n\n`eot_harness.deepgram_flux_adapter:DeepgramFluxStreamingAdapter`\n\n`eot_harness.assemblyai_adapter:AssemblyAIStreamingAdapter`\n\n`eot_harness.soniox_adapter:SonioxStreamingAdapter`\n\n`eot_harness.openai_realtime_adapter:OpenAIRealtime2Adapter`\n\nStreaming STT adapters produce `p_eot`\n\nfrom the provider's native endpointing\nsurface. Deepgram Flux and AssemblyAI expose confidence-style scores. Soniox,\nand OpenAI Realtime semantic VAD currently map endpoint events to binary scores:\n`0.0`\n\nbefore the provider endpoint event has fired and `1.0`\n\nafter it has fired.\nThe AssemblyAI adapter defaults to `universal-streaming-multilingual`\n\nwith\n`min_turn_silence=100`\n\n, `max_turn_silence=3000`\n\n, and\n`end_of_turn_confidence_threshold=0.1`\n\nso the harness receives probability-valued\n`end_of_turn_confidence`\n\nevents across AssemblyAI's supported dataset languages.\n\n`LiveKitTurnDetectorAdapter`\n\nis a streaming adapter that scores each turn\nwith the LiveKit Turn Detector v1 (`turn-detector-v1`\n\n) cloud model over the\nagent-gateway EOT websocket;\nrun it with `predict-streaming`\n\nand `LIVEKIT_API_KEY`\n\n/`LIVEKIT_API_SECRET`\n\n.\n\nUltraVAD currently supports `--batch-size 1`\n\nonly.\n\n`LiveKitTurnDetectorMiniAdapter`\n\nparallelizes calls to the public local-inference\ninterface with up to 8 worker threads by default. Set\n`LIVEKIT_TURN_DETECTOR_MINI_WORKERS=1`\n\nfor single-threaded runs, or another positive\ninteger to tune local throughput.\n\nThe LiveKit text adapters in `eot_harness.livekit_text_adapter`\n\nare deprecated and\nkept only for old comparisons.\n\nText fields are model-specific requirements, not global schema requirements.\nUltraVAD consumes previous assistant text when `messages`\n\nis present. Deprecated\nLiveKit text adapters need transcript text from `words`\n\nand/or `messages`\n\n.\n\n**Modal Batch Prediction**\n\nRun Modal-backed batch prediction jobs with the package-owned entrypoint:\n\n```\nmodal run eot_harness.modal_runner::run_predict \\\n  --config-json '{\"path\":\"livekit/eot-bench-data\",\"name\":\"all\",\"split\":\"validation\",\"adapter\":\"eot_harness.livekit_turn_detector_mini_adapter:LiveKitTurnDetectorMiniAdapter\",\"modal_preset\":\"audio\"}'\n```\n\nThe Modal runner writes the same `predictions.parquet`\n\nand `manifest.json`\n\nas\nlocal `predict`\n\n, including the same span-set/model-run directory layout under\n`--output-dir`\n\n, which defaults to `output`\n\n. Supported presets are `default`\n\n,\n`audio`\n\n, and `ultravad`\n\n; choose one with `--preset`\n\nor by adding\n`\"modal_preset\": \"audio\"`\n\nto the config JSON.\n\nThe Modal runner currently calls the batch `predict`\n\npath, not\n`predict-streaming`\n\n.", "url": "https://wpnews.pro/news/eot-bench-open-benchmark-suite-for-end-of-turn-detection-in-voice-ai", "canonical_source": "https://github.com/livekit/eot-bench", "published_at": "2026-06-18 17:14:14+00:00", "updated_at": "2026-06-18 17:31:12.562211+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-products", "ai-research", "natural-language-processing"], "entities": ["LiveKit", "LiveKit Turn Detector v1", "Deepgram Flux", "ultraVAD", "SmartTurn v3.2", "AssemblyAI", "Soniox", "OpenAI GPT Realtime 2"], "alternates": {"html": "https://wpnews.pro/news/eot-bench-open-benchmark-suite-for-end-of-turn-detection-in-voice-ai", "markdown": "https://wpnews.pro/news/eot-bench-open-benchmark-suite-for-end-of-turn-detection-in-voice-ai.md", "text": "https://wpnews.pro/news/eot-bench-open-benchmark-suite-for-end-of-turn-detection-in-voice-ai.txt", "jsonld": "https://wpnews.pro/news/eot-bench-open-benchmark-suite-for-end-of-turn-detection-in-voice-ai.jsonld"}}