Wav2vec2 / WavLM audio classifier stuck at chance (33%) — only training the head

A developer reports that a Wav2Vec2/WavLM audio classifier for distinguishing Normal, Lateral, and Interdental sibilants is stuck at 33% accuracy when only training the classification head. The issue is attributed to a combination of excessive padding, lack of attention masks, and frozen feature encoders, rather than class imbalance alone.

Hmm… it looks like this may be one of those problems that is not unsolvable, but is deeper than it first appears: Direct answers to your four suspected points: Linear probing / head-only training: I would not expect head-only training to be reliable for this task. It is worth testing as a baseline, but for Normal vs Lateral vs Interdental /s,z/, I would try at least freeze feature encoder only, or unfreezing the last few transformer layers. Learning rate: 1e-3 may be reasonable for a newly initialized head, but it is probably too high if you unfreeze the encoder. If you fine-tune encoder layers, I would use separate parameter groups: lower LR for the backbone, higher LR for the classifier head. Padding / attention mask: yes, I would treat the current fixed 1.0s padding as a major suspect. The issue is not only “should I pass attention mask ?”, because Wav2Vec2 attention-mask behavior is checkpoint-dependent. The bigger issue is that the actual fricative may be a small fraction of the pooled sequence. Class imbalance: I would handle it, but not first. Since the largest class is 580/1057, an always-Lateral baseline would be around 55%, so a ~33% result suggests something deeper than ordinary class imbalance. So my first debugging order would be: tiny-subset overfit → label / fold / trainable-parameter sanity checks → padding + pooling ablation → partial unfreezing / layer selection → acoustic baseline → then class weighting or sampling Your class counts are: | class | count | |---|---| | Lateral | 580 | | Normal | 243 | | Interdental | 234 | | total | 1057 | If the model truly predicted only the majority class on the whole distribution, the accuracy should be closer to 55%, not 33%. Fold-level distributions can change this, of course, but the shape still makes me suspicious of something else: Class imbalance may still hurt macro-F1 and minority recall, but I would not make it the first explanation. The task itself does not look impossible. Similar sibilant / sigmatism-style classification problems have been studied before, often with MFCCs, log-Mel features, spectral-band energy, CNNs, or other acoustic features. So I would not conclude from the 33% result that the labels are necessarily impossible or that Wav2Vec2/WavLM are useless here. But the current setup combines several hard conditions: | issue | why it matters | |---|---| | median clip is ~0.16s | the useful fricative signal may only be a few feature frames | | all clips padded/truncated to 1.0s | the pooled representation may be dominated by padding / near-silence behavior | no attention mask | default sequence pooling may average over frames that are not really useful | freeze base model | only the classification head adapts, not the SSL representation | | default pooled sequence classifier | a very local acoustic contrast becomes one global vector | | 3-way subtle articulation labels | Normal / Lateral / Interdental may require fine spectral/phonetic cues | So I would treat the current result as a sign that the pipeline needs ablation , not as a final verdict on the task. My best guess is that this is not one single bug. It is a stack of smaller issues pointing in the same direction: So I would phrase the diagnosis like this: This still looks worth pursuing, but I would not expect class weighting alone to fix it. First prove that the labels are learnable with a tiny subset and an acoustic baseline. Then change the SSL extraction path: less padding, better pooling, layer selection, and partial fine-tuning. Wav2Vec2 docs / attention-mask note: Wav2Vec2 · Hugging Face https://huggingface.co/docs/transformers/en/model doc/wav2vec2 WavLM sequence classification docs: WavLM · Hugging Face https://huggingface.co/docs/transformers/model doc/wavlm HF audio classification guide: Audio classification · Hugging Face https://huggingface.co/docs/transformers/en/tasks/audio classification Wav2Vec2 post-convolution attention-mask issue: Return updated attention mask from Wav2Vec 2.0 · Issue 25307 · huggingface/transformers · GitHub https://github.com/huggingface/transformers/issues/25307 SUPERB benchmark / frozen SSL evaluation context: https://sls.csail.mit.edu/publications/2021/JeffLai Interspeech 2021.pdf https://sls.csail.mit.edu/publications/2021/JeffLai Interspeech 2021.pdf Sigmatism detection paper: https://www.cstr.ed.ac.uk/downloads/publications/2012/Cassia WOCCI12.pdf https://www.cstr.ed.ac.uk/downloads/publications/2012/Cassia WOCCI12.pdf Sibilant consonant classification with neural networks: https://dspace.rcaap.pt/entities/publication/c0cc15ad-cf57-4a4c-bfd4-423d2e2894f5 https://dspace.rcaap.pt/entities/publication/c0cc15ad-cf57-4a4c-bfd4-423d2e2894f5 Wav2Vec2 layer-wise / suprasegmental analysis: 2408.13678 A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models https://arxiv.org/abs/2408.13678 Wav2Vec2 phonetic / tonal / speaker information analysis: 2506.10855 Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models https://arxiv.org/abs/2506.10855 Accent identification fine-tuning and phoneme/prosody probing: 2306.06524 What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model https://arxiv.org/abs/2306.06524