Direct Preference Optimization for English-Mandarin Code-Switching Speech Recognition in Audio LLMs

Researchers have identified three systematic failure modes in Audio LLMs when transcribing English-Mandarin code-switching speech: language omission, translation-instead-of-transcription, and hallucination. By applying Direct Preference Optimization (DPO) to align models using 100,000 preference pairs from 570 hours of audio, the team achieved mixed error rate reductions of up to 89.6% in-distribution and 20.0% out-of-distribution. The findings demonstrate that DPO can effectively elicit correct code-switching transcription behavior from multilingual Audio LLMs without architectural changes.

arXiv:2605.23975v1 Announce Type: new Abstract: Audio large language models Audio LLMs exhibit systematic failures in transcribing code-switching speech despite strong multilingual capabilities. Focusing on English-Mandarin, we identify three failure modes: language omission, translation-instead-of-transcription, and hallucination. We apply Direct Preference Optimization DPO to align models, constructing preference pairs in which chosen responses preserve mixed-language content while rejected responses mimic failure patterns. Training three Audio LLMs on 100K pairs 570 hours , we observe consistent behavioral shifts: models learn to preserve language composition rather than translating when prompted for transcription. This alignment yields MER reductions up to 89.6% in-distribution and 20.0% out-of-distribution . Our findings suggest DPO can effectively elicit correct code-switching transcription behavior from multilingual Audio LLMs.