Wav2vec2 / WavLM audio classifier stuck at chance (33%) — only training the head

wpnews.pro

cd /news/machine-learning/wav2vec2-wavlm-audio-classifier-stuc… · home › topics › machine-learning › article

[ARTICLE · art-37473] src=discuss.huggingface.co ↗ pub=2026-06-24T08:49Z topic=machine-learning verified=true sentiment=· neutral

Wav2vec2 / WavLM audio classifier stuck at chance (33%) — only training the head

A developer reports that a Wav2Vec2/WavLM audio classifier for distinguishing Normal, Lateral, and Interdental sibilants is stuck at 33% accuracy when only training the classification head. The issue is attributed to a combination of excessive padding, lack of attention masks, and frozen feature encoders, rather than class imbalance alone.

read4 min views1 publishedJun 24, 2026

Hmm… it looks like this may be one of those problems that is not unsolvable, but is deeper than it first appears:

Direct answers to your four suspected points:

Linear probing / head-only training: I would not expect head-only training to be reliable for this task. It is worth testing as a baseline, but for Normal vs Lateral vs Interdental /s,z/, I would try at least freeze_feature_encoder()

only, or unfreezing the last few transformer layers.

Learning rate: 1e-3

may be reasonable for a newly initialized head, but it is probably too high if you unfreeze the encoder. If you fine-tune encoder layers, I would use separate parameter groups: lower LR for the backbone, higher LR for the classifier head.

Padding / attention mask: yes, I would treat the current fixed 1.0s padding as a major suspect. The issue is not only “should I pass attention_mask

?”, because Wav2Vec2 attention-mask behavior is checkpoint-dependent. The bigger issue is that the actual fricative may be a small fraction of the pooled sequence.

Class imbalance: I would handle it, but not first. Since the largest class is 580/1057, an always-Lateral baseline would be around 55%, so a ~33% result suggests something deeper than ordinary class imbalance.

So my first debugging order would be:

tiny-subset overfit
→ label / fold / trainable-parameter sanity checks
→ padding + pooling ablation
→ partial unfreezing / layer selection
→ acoustic baseline
→ then class weighting or sampling

Your class counts are:

class	count
Lateral	580
Normal	243
Interdental	234
total	1057

If the model truly predicted only the majority class on the whole distribution, the accuracy should be closer to 55%, not 33%. Fold-level distributions can change this, of course, but the shape still makes me suspicious of something else:

Class imbalance may still hurt macro-F1 and minority recall, but I would not make it the first explanation.

The task itself does not look impossible. Similar sibilant / sigmatism-style classification problems have been studied before, often with MFCCs, log-Mel features, spectral-band energy, CNNs, or other acoustic features. So I would not conclude from the 33% result that the labels are necessarily impossible or that Wav2Vec2/WavLM are useless here.

But the current setup combines several hard conditions:

issue	why it matters
median clip is ~0.16s	the useful fricative signal may only be a few feature frames
all clips padded/truncated to 1.0s	the pooled representation may be dominated by padding / near-silence behavior
no `attention_mask`
default sequence pooling may average over frames that are not really useful
`freeze_base_model()`
only the classification head adapts, not the SSL representation
default pooled sequence classifier	a very local acoustic contrast becomes one global vector
3-way subtle articulation labels	Normal / Lateral / Interdental may require fine spectral/phonetic cues

So I would treat the current result as a sign that the pipeline needs ablation, not as a final verdict on the task.

My best guess is that this is not one single bug. It is a stack of smaller issues pointing in the same direction:

So I would phrase the diagnosis like this:

This still looks worth pursuing, but I would not expect class weighting alone to fix it. First prove that the labels are learnable with a tiny subset and an acoustic baseline. Then change the SSL extraction path: less padding, better pooling, layer selection, and partial fine-tuning.

Wav2Vec2 docs / attention-mask note:

Wav2Vec2 · Hugging Face

WavLM sequence classification docs:

WavLM · Hugging Face

HF audio classification guide:

Audio classification · Hugging Face

Wav2Vec2 post-convolution attention-mask issue:

Return updated attention mask from Wav2Vec 2.0 · Issue #25307 · huggingface/transformers · GitHub

SUPERB benchmark / frozen SSL evaluation context:

https://sls.csail.mit.edu/publications/2021/JeffLai_Interspeech_2021.pdf

Sigmatism detection paper:

https://www.cstr.ed.ac.uk/downloads/publications/2012/Cassia_WOCCI12.pdf

Sibilant consonant classification with neural networks:

https://dspace.rcaap.pt/entities/publication/c0cc15ad-cf57-4a4c-bfd4-423d2e2894f5

Wav2Vec2 layer-wise / suprasegmental analysis:

[2408.13678] A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models

Wav2Vec2 phonetic / tonal / speaker information analysis:

[2506.10855] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models

Accent identification fine-tuning and phoneme/prosody probing:

[2306.06524] What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model

source & further reading

discuss.huggingface.co — original article Rakarrack-0.6.1 port making progress! ( AI assisted ) Cloud Storage Poll Welcome to Haiku basic(Haiku Docs, Haiku slide and Haiku sheets)

~/api · this article 200

$curl api.wpnews.pro/v1/news/wav2vec2-wavlm-audio-cla…

Read original on discuss.huggingface.co → discuss.huggingface.co/t/wav2vec2-wavlm-audio-cl…

mentioned entities

Wav2Vec2

WavLM

Hugging Face

metadata

slugwav2vec2-wavlm-audio-classifier-stuck-at-chance-33-only-training-the-head

topic#machine-learning

sentimentneutral

canonicaldiscuss.huggingface.co

navigation

← prevPFRDA launches Pension Sahayak g…

next →Visual Studio Code 1.126

── more in #machine-learning 4 stories · sorted by recency

discuss.huggingface.co · 24 Jun · #machine-learning

AI Music Model That Runs in Real Time on Most CPUs, Locally in the Browser

epics.tech · 24 Jun · #machine-learning

AI Steps Off the Screen

discuss.huggingface.co · 23 Jun · #machine-learning

Llama 3.1 70B API access?

marktechpost.com · 23 Jun · #machine-learning

How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python

── more on @wav2vec2 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 22 Jun · #large-language-models

MCP vs Skills: Why Skills Save Context Tokens

wpnews · 22 Jun · #artificial-intelligence

Value for Money Is All You Need

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required