AURA-ST: Acoustic-Unconstrained Residual Architecture for Speech Translation

Researchers presented AURA-ST, a three-stage modular pipeline for low-resource speech-to-text translation at IWSLT 2026. The system bypasses traditional cross-attention by treating projected acoustic representations as token prefixes to a frozen large language model, achieving a best SacreBLEU of 91.29 on the validation set for Hausa, Igbo, and Yoruba translation.

AURA-ST: Acoustic-Unconstrained Residual Architecture for Speech Translation https://aclanthology.org/2026.iwslt-1.28.pdf Barathi Ganesh HB /people/barathi-ganesh-hb/ , Michal Ptaszynski /people/michal-ptaszynski/ , Jairam R /people/jairam-r/ , Reshma Unnikrishnan /people/reshma-unnikrishnan/ Abstract We present AURA-ST, a three-stage modular pipeline for low-resource speech-to-text translation submitted to the IWSLT 2026 African-Celtic Track 1. The architecture bypasses traditional cross-attention between audio and text modalities by treating projected acoustic representations as a native token prefix to a frozen large language model. A dual-stream encoder captures linguistic and paralinguistic features via a jointly trained semantic and a paralinguistic encoder. A convolutional subsampler then bridges the modality gap through a 4x temporal compression and a linear projection into the LLM embedding space. Finally, a MLP-targeted Low-Rank Adaptation adapter fine-tunes the frozen Gemma-4-E2B backbone for translation without catastrophic forgetting of base language model knowledge. We further identify and resolve the incompatibility between standard PEFT attention-level adapter injection and the Gemma-4 Per-Layer Embedding architecture that tends to cause gradient isolation. Trained on the IWSLT 2026 Track 1 data covering Hausa, Igbo, and Yoruba, the final system achieves a best proxy teacher-forced SacreBLEU of 91.29 on the validation set at Phase 3, with Phase 1 speech encoder validation loss converging to 0.651.- Anthology ID: - 2026.iwslt-1.28 - Volume: Proceedings of the 23rd International Conference on Spoken Language Translation IWSLT 2026 /volumes/2026.iwslt-1/ - Month: - July - Year: - 2026 - Address: - San Diego, USA in-person and online - Editors: Elizabeth Salesky /people/elizabeth-salesky/ , Antonios Anastasopoulos /people/antonios-anastasopoulos/ , Matteo Negri /people/matteo-negri/ , Marcello Federico /people/marcello-federico/ - Venues: IWSLT /venues/iwslt/ | WS /venues/ws/ - SIG: SIGSLT /sigs/sigslt/ - Publisher: - Association for Computational Linguistics - Note: - Pages: - 247–254 - Language: - URL: https://aclanthology.org/2026.iwslt-1.28/ https://aclanthology.org/2026.iwslt-1.28/ - DOI: - Cite ACL : - Barathi Ganesh HB, Michal Ptaszynski, Jairam R, and Reshma Unnikrishnan. 2026. AURA-ST: Acoustic-Unconstrained Residual Architecture for Speech Translation https://aclanthology.org/2026.iwslt-1.28/ . In Proceedings of the 23rd International Conference on Spoken Language Translation IWSLT 2026 , pages 247–254, San Diego, USA in-person and online . Association for Computational Linguistics. - Cite Informal : AURA-ST: Acoustic-Unconstrained Residual Architecture for Speech Translation https://aclanthology.org/2026.iwslt-1.28/ HB et al., IWSLT 2026 - PDF: https://aclanthology.org/2026.iwslt-1.28.pdf https://aclanthology.org/2026.iwslt-1.28.pdf