A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

Researchers introduced a practical evaluation method for long-form simultaneous speech-to-speech translation (SimulS2ST), using ASR and forced alignment to compute sentence-level latency and quality metrics. Experiments showed current systems suffer from substantial latency accumulation on long speech.

arXiv:2606.15059v1 Announce Type: new Abstract: Simultaneous speech-to-speech translation SimulS2ST enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior approaches are difficult to reproduce and make assumptions that do not hold for end-to-end systems. We present a practical evaluation method for long-form SimulS2ST. Given source speech, pre-segmented source transcripts, and reference translations, we run automatic speech recognition ASR and forced alignment on the generated target speech to recover token-level timestamps, then apply a sentence-embedding-based aligner to match the target text to its corresponding source sentences. This enables sentence-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system-level scores. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech.