{"slug": "wav2vec2-wavlm-audio-classifier-stuck-at-chance-33-only-training-the-head", "title": "Wav2vec2 / WavLM audio classifier stuck at chance (33%) — only training the head", "summary": "A developer reports that a Wav2Vec2/WavLM audio classifier for distinguishing Normal, Lateral, and Interdental sibilants is stuck at 33% accuracy when only training the classification head. The issue is attributed to a combination of excessive padding, lack of attention masks, and frozen feature encoders, rather than class imbalance alone.", "body_md": "Hmm… it looks like this may be one of those problems that is not unsolvable, but is deeper than it first appears:\n\nDirect answers to your four suspected points:\n\n**Linear probing / head-only training:** I would not expect head-only training to be reliable for this task. It is worth testing as a baseline, but for Normal vs Lateral vs Interdental /s,z/, I would try at least `freeze_feature_encoder()`\n\nonly, or unfreezing the last few transformer layers.\n\n**Learning rate:** `1e-3`\n\nmay be reasonable for a newly initialized head, but it is probably too high if you unfreeze the encoder. If you fine-tune encoder layers, I would use separate parameter groups: lower LR for the backbone, higher LR for the classifier head.\n\n**Padding / attention mask:** yes, I would treat the current fixed 1.0s padding as a major suspect. The issue is not only “should I pass `attention_mask`\n\n?”, because Wav2Vec2 attention-mask behavior is checkpoint-dependent. The bigger issue is that the actual fricative may be a small fraction of the pooled sequence.\n\n**Class imbalance:** I would handle it, but not first. Since the largest class is 580/1057, an always-Lateral baseline would be around 55%, so a ~33% result suggests something deeper than ordinary class imbalance.\n\nSo my first debugging order would be:\n\n```\ntiny-subset overfit\n→ label / fold / trainable-parameter sanity checks\n→ padding + pooling ablation\n→ partial unfreezing / layer selection\n→ acoustic baseline\n→ then class weighting or sampling\n```\n\nYour class counts are:\n\n| class | count |\n|---|---|\n| Lateral | 580 |\n| Normal | 243 |\n| Interdental | 234 |\n| total | 1057 |\n\nIf the model truly predicted only the majority class on the whole distribution, the accuracy should be closer to 55%, not 33%. Fold-level distributions can change this, of course, but the shape still makes me suspicious of something else:\n\nClass imbalance may still hurt macro-F1 and minority recall, but I would not make it the first explanation.\n\nThe task itself does not look impossible. Similar sibilant / sigmatism-style classification problems have been studied before, often with MFCCs, log-Mel features, spectral-band energy, CNNs, or other acoustic features. So I would not conclude from the 33% result that the labels are necessarily impossible or that Wav2Vec2/WavLM are useless here.\n\nBut the current setup combines several hard conditions:\n\n| issue | why it matters |\n|---|---|\n| median clip is ~0.16s | the useful fricative signal may only be a few feature frames |\n| all clips padded/truncated to 1.0s | the pooled representation may be dominated by padding / near-silence behavior |\nno `attention_mask` |\ndefault sequence pooling may average over frames that are not really useful |\n`freeze_base_model()` |\nonly the classification head adapts, not the SSL representation |\n| default pooled sequence classifier | a very local acoustic contrast becomes one global vector |\n| 3-way subtle articulation labels | Normal / Lateral / Interdental may require fine spectral/phonetic cues |\n\nSo I would treat the current result as a sign that the **pipeline needs ablation**, not as a final verdict on the task.\n\nMy best guess is that this is not one single bug. It is a stack of smaller issues pointing in the same direction:\n\nSo I would phrase the diagnosis like this:\n\nThis still looks worth pursuing, but I would not expect class weighting alone to fix it. First prove that the labels are learnable with a tiny subset and an acoustic baseline. Then change the SSL extraction path: less padding, better pooling, layer selection, and partial fine-tuning.\n\nWav2Vec2 docs / attention-mask note:\n\n[Wav2Vec2 · Hugging Face](https://huggingface.co/docs/transformers/en/model_doc/wav2vec2)\n\nWavLM sequence classification docs:\n\n[WavLM · Hugging Face](https://huggingface.co/docs/transformers/model_doc/wavlm)\n\nHF audio classification guide:\n\n[Audio classification · Hugging Face](https://huggingface.co/docs/transformers/en/tasks/audio_classification)\n\nWav2Vec2 post-convolution attention-mask issue:\n\n[Return updated attention mask from Wav2Vec 2.0 · Issue #25307 · huggingface/transformers · GitHub](https://github.com/huggingface/transformers/issues/25307)\n\nSUPERB benchmark / frozen SSL evaluation context:\n\n[https://sls.csail.mit.edu/publications/2021/JeffLai_Interspeech_2021.pdf](https://sls.csail.mit.edu/publications/2021/JeffLai_Interspeech_2021.pdf)\n\nSigmatism detection paper:\n\n[https://www.cstr.ed.ac.uk/downloads/publications/2012/Cassia_WOCCI12.pdf](https://www.cstr.ed.ac.uk/downloads/publications/2012/Cassia_WOCCI12.pdf)\n\nSibilant consonant classification with neural networks:\n\n[https://dspace.rcaap.pt/entities/publication/c0cc15ad-cf57-4a4c-bfd4-423d2e2894f5](https://dspace.rcaap.pt/entities/publication/c0cc15ad-cf57-4a4c-bfd4-423d2e2894f5)\n\nWav2Vec2 layer-wise / suprasegmental analysis:\n\n[[2408.13678] A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models](https://arxiv.org/abs/2408.13678)\n\nWav2Vec2 phonetic / tonal / speaker information analysis:\n\n[[2506.10855] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models](https://arxiv.org/abs/2506.10855)\n\nAccent identification fine-tuning and phoneme/prosody probing:\n\n[[2306.06524] What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model](https://arxiv.org/abs/2306.06524)", "url": "https://wpnews.pro/news/wav2vec2-wavlm-audio-classifier-stuck-at-chance-33-only-training-the-head", "canonical_source": "https://discuss.huggingface.co/t/wav2vec2-wavlm-audio-classifier-stuck-at-chance-33-only-training-the-head/177119#post_2", "published_at": "2026-06-24 08:49:11+00:00", "updated_at": "2026-06-24 08:49:40.272954+00:00", "lang": "en", "topics": ["machine-learning"], "entities": ["Wav2Vec2", "WavLM", "Hugging Face"], "alternates": {"html": "https://wpnews.pro/news/wav2vec2-wavlm-audio-classifier-stuck-at-chance-33-only-training-the-head", "markdown": "https://wpnews.pro/news/wav2vec2-wavlm-audio-classifier-stuck-at-chance-33-only-training-the-head.md", "text": "https://wpnews.pro/news/wav2vec2-wavlm-audio-classifier-stuck-at-chance-33-only-training-the-head.txt", "jsonld": "https://wpnews.pro/news/wav2vec2-wavlm-audio-classifier-stuck-at-chance-33-only-training-the-head.jsonld"}}