EchoDistill:Alignment Noisy-to-Clean Self-Distillation for Robust Audio LLMs

Researchers have developed EchoDistill, a self-distillation framework that improves Audio Large Language Models' resistance to real-world noise by aligning noisy student responses with clean semantic references from a frozen teacher model. The method, which uses group-relative policy optimization to reward token-level consistency, achieved average improvements of 4.18% in semantic reliability under strong noise without additional inference costs. The approach addresses a critical vulnerability in audio LLMs that often causes semantic drift and hallucinations in noisy environments.

arXiv:2605.23954v1 Announce Type: new Abstract: Audio Large Language Models ALLMs are highly vulnerable to real-world noise, which often induces severe semantic drift and hallucinations. Existing robustness methods primarily rely on waveform-level acoustic enhancement, answer-level supervision, or the internal suppression of noise representations. To address these issues, we propose echodistill, an alignment-based noisy-to-clean self-distillation framework. Echodistill leverages a frozen clean-audio teacher to provide semantic references for an inference-time noisy-audio student. Specifically, the student samples candidate responses under noisy conditions to expose its test-time behavior. These trajectories are then optimized via group-relative policy optimization GRPO , where the token-level consistency with the teacher acts as a reward bonus. By aligning the noisy student's candidate responses with clean semantic evidence, and applying audio-aware reward shaping, our method encourages reasoning trajectories that are both correct and genuinely acoustically grounded. Echodistill significantly improves the semantic reliability and task performance of Audio LLMs under complex noise, without introducing any additional inference costs. Extensive experiments show that: I Compared with the strongest baseline, echodistill achieves average improvements of 4.18\%$\uparrow$ in GSR under strong noise. II Ablation results on Qwen-Omni further show that echodistill improves over the GRPO-only variant by 3.02\%$\uparrow$ in Acc, 3.89\%$\uparrow$ in Noisy, and 4.53\%$\uparrow$ in GSR on average. Our codes are available at https://anonymous.4open.science/r/echodistill-10DE.