cd /news/machine-learning/data-scale-not-latency-shapes-cross-… · home topics machine-learning article
[ARTICLE · art-37255] src=arxiv.org ↗ pub= topic=machine-learning verified=true sentiment=· neutral

Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

A new study from arXiv shows that multilingual encoder initialization benefits streaming automatic speech recognition (ASR) only at low target-language data scales, with the advantage decaying as data increases. Across eight European languages, the word error rate gap between multilingual and English-only initialization shrinks from +4.21 percentage points at 100 hours to near zero at 2500 hours, and this pattern holds across different streaming latency tiers. The findings suggest that latency and quantization decisions can be made independently of the initialization choice.

read1 min views2 publishedJun 24, 2026

arXiv:2606.24169v1 Announce Type: new Abstract: Adapting a streaming speech recognition model to a new language requires choosing between two plausible warm starts: a multilingual (ML) encoder or an English-only (EN) encoder. The common intuition is that the multilingual encoder should help most at low data, but it is unclear how long that advantage persists, whether tight streaming latency amplifies it, and whether it survives deployment quantization. We answer these questions with a controlled sweep of a 0.6 B-parameter cache-aware FastConformer transducer across eight European languages, up to five target-language data scales (100 h to 2500 h), three streaming tiers plus offline decoding, and up to four public test sets. The main result is that multilingual initialization is a data-limited advantage, not a latency-limited one. On FLEURS at 160 ms, the mean EN-ML word error rate (WER) gap falls from +4.21 percentage points (pp) at 100 h to +0.20 pp at 2500 h; a power-law fit summarizes this decay, with each doubling of target-language data roughly halving the remaining advantage. Across the three streaming tiers, the across-language mean EN-ML gap is approximately stable at each scale from 100 to 1000 h, and is near zero by 2500 h. Finally, 4-bit weight-only encoder quantization at the matched 560 ms streaming tier reduces the encoder footprint by about 3x, with an average FLEURS WER increase of about 0.5 pp. The resulting guideline is simple: use multilingual initialization in low-data regimes, treat the choice as effectively irrelevant at large data, and make latency and quantization decisions independently.

── more in #machine-learning 4 stories · sorted by recency
── more on @arxiv 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/data-scale-not-laten…] indexed:0 read:1min 2026-06-24 ·