cd /news/machine-learning/wav2vec2-wavlm-audio-classifier-stuc… · home topics machine-learning article
[ARTICLE · art-37473] src=discuss.huggingface.co ↗ pub= topic=machine-learning verified=true sentiment=· neutral

Wav2vec2 / WavLM audio classifier stuck at chance (33%) — only training the head

A developer reports that a Wav2Vec2/WavLM audio classifier for distinguishing Normal, Lateral, and Interdental sibilants is stuck at 33% accuracy when only training the classification head. The issue is attributed to a combination of excessive padding, lack of attention masks, and frozen feature encoders, rather than class imbalance alone.

read4 min views1 publishedJun 24, 2026

Hmm… it looks like this may be one of those problems that is not unsolvable, but is deeper than it first appears:

Direct answers to your four suspected points:

Linear probing / head-only training: I would not expect head-only training to be reliable for this task. It is worth testing as a baseline, but for Normal vs Lateral vs Interdental /s,z/, I would try at least freeze_feature_encoder()

only, or unfreezing the last few transformer layers.

Learning rate: 1e-3

may be reasonable for a newly initialized head, but it is probably too high if you unfreeze the encoder. If you fine-tune encoder layers, I would use separate parameter groups: lower LR for the backbone, higher LR for the classifier head.

Padding / attention mask: yes, I would treat the current fixed 1.0s padding as a major suspect. The issue is not only “should I pass attention_mask

?”, because Wav2Vec2 attention-mask behavior is checkpoint-dependent. The bigger issue is that the actual fricative may be a small fraction of the pooled sequence.

Class imbalance: I would handle it, but not first. Since the largest class is 580/1057, an always-Lateral baseline would be around 55%, so a ~33% result suggests something deeper than ordinary class imbalance.

So my first debugging order would be:

tiny-subset overfit
→ label / fold / trainable-parameter sanity checks
→ padding + pooling ablation
→ partial unfreezing / layer selection
→ acoustic baseline
→ then class weighting or sampling

Your class counts are:

class count
Lateral 580
Normal 243
Interdental 234
total 1057

If the model truly predicted only the majority class on the whole distribution, the accuracy should be closer to 55%, not 33%. Fold-level distributions can change this, of course, but the shape still makes me suspicious of something else:

Class imbalance may still hurt macro-F1 and minority recall, but I would not make it the first explanation.

The task itself does not look impossible. Similar sibilant / sigmatism-style classification problems have been studied before, often with MFCCs, log-Mel features, spectral-band energy, CNNs, or other acoustic features. So I would not conclude from the 33% result that the labels are necessarily impossible or that Wav2Vec2/WavLM are useless here.

But the current setup combines several hard conditions:

issue why it matters
median clip is ~0.16s the useful fricative signal may only be a few feature frames
all clips padded/truncated to 1.0s the pooled representation may be dominated by padding / near-silence behavior
no attention_mask
default sequence pooling may average over frames that are not really useful
freeze_base_model()
only the classification head adapts, not the SSL representation
default pooled sequence classifier a very local acoustic contrast becomes one global vector
3-way subtle articulation labels Normal / Lateral / Interdental may require fine spectral/phonetic cues

So I would treat the current result as a sign that the pipeline needs ablation, not as a final verdict on the task.

My best guess is that this is not one single bug. It is a stack of smaller issues pointing in the same direction:

So I would phrase the diagnosis like this:

This still looks worth pursuing, but I would not expect class weighting alone to fix it. First prove that the labels are learnable with a tiny subset and an acoustic baseline. Then change the SSL extraction path: less padding, better pooling, layer selection, and partial fine-tuning.

Wav2Vec2 docs / attention-mask note:

Wav2Vec2 · Hugging Face

WavLM sequence classification docs:

WavLM · Hugging Face

HF audio classification guide:

Audio classification · Hugging Face

Wav2Vec2 post-convolution attention-mask issue:

Return updated attention mask from Wav2Vec 2.0 · Issue #25307 · huggingface/transformers · GitHub

SUPERB benchmark / frozen SSL evaluation context:

https://sls.csail.mit.edu/publications/2021/JeffLai_Interspeech_2021.pdf

Sigmatism detection paper:

https://www.cstr.ed.ac.uk/downloads/publications/2012/Cassia_WOCCI12.pdf

Sibilant consonant classification with neural networks:

https://dspace.rcaap.pt/entities/publication/c0cc15ad-cf57-4a4c-bfd4-423d2e2894f5

Wav2Vec2 layer-wise / suprasegmental analysis:

[2408.13678] A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models

Wav2Vec2 phonetic / tonal / speaker information analysis:

[2506.10855] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Accent identification fine-tuning and phoneme/prosody probing:

[2306.06524] What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model

── more in #machine-learning 4 stories · sorted by recency
── more on @wav2vec2 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/wav2vec2-wavlm-audio…] indexed:0 read:4min 2026-06-24 ·