# Wav2vec2 / WavLM audio classifier stuck at chance (33%) — only training the head

> Source: <https://discuss.huggingface.co/t/wav2vec2-wavlm-audio-classifier-stuck-at-chance-33-only-training-the-head/177119#post_2>
> Published: 2026-06-24 08:49:11+00:00

Hmm… it looks like this may be one of those problems that is not unsolvable, but is deeper than it first appears:

Direct answers to your four suspected points:

**Linear probing / head-only training:** I would not expect head-only training to be reliable for this task. It is worth testing as a baseline, but for Normal vs Lateral vs Interdental /s,z/, I would try at least `freeze_feature_encoder()`

only, or unfreezing the last few transformer layers.

**Learning rate:** `1e-3`

may be reasonable for a newly initialized head, but it is probably too high if you unfreeze the encoder. If you fine-tune encoder layers, I would use separate parameter groups: lower LR for the backbone, higher LR for the classifier head.

**Padding / attention mask:** yes, I would treat the current fixed 1.0s padding as a major suspect. The issue is not only “should I pass `attention_mask`

?”, because Wav2Vec2 attention-mask behavior is checkpoint-dependent. The bigger issue is that the actual fricative may be a small fraction of the pooled sequence.

**Class imbalance:** I would handle it, but not first. Since the largest class is 580/1057, an always-Lateral baseline would be around 55%, so a ~33% result suggests something deeper than ordinary class imbalance.

So my first debugging order would be:

```
tiny-subset overfit
→ label / fold / trainable-parameter sanity checks
→ padding + pooling ablation
→ partial unfreezing / layer selection
→ acoustic baseline
→ then class weighting or sampling
```

Your class counts are:

| class | count |
|---|---|
| Lateral | 580 |
| Normal | 243 |
| Interdental | 234 |
| total | 1057 |

If the model truly predicted only the majority class on the whole distribution, the accuracy should be closer to 55%, not 33%. Fold-level distributions can change this, of course, but the shape still makes me suspicious of something else:

Class imbalance may still hurt macro-F1 and minority recall, but I would not make it the first explanation.

The task itself does not look impossible. Similar sibilant / sigmatism-style classification problems have been studied before, often with MFCCs, log-Mel features, spectral-band energy, CNNs, or other acoustic features. So I would not conclude from the 33% result that the labels are necessarily impossible or that Wav2Vec2/WavLM are useless here.

But the current setup combines several hard conditions:

| issue | why it matters |
|---|---|
| median clip is ~0.16s | the useful fricative signal may only be a few feature frames |
| all clips padded/truncated to 1.0s | the pooled representation may be dominated by padding / near-silence behavior |
no `attention_mask` |
default sequence pooling may average over frames that are not really useful |
`freeze_base_model()` |
only the classification head adapts, not the SSL representation |
| default pooled sequence classifier | a very local acoustic contrast becomes one global vector |
| 3-way subtle articulation labels | Normal / Lateral / Interdental may require fine spectral/phonetic cues |

So I would treat the current result as a sign that the **pipeline needs ablation**, not as a final verdict on the task.

My best guess is that this is not one single bug. It is a stack of smaller issues pointing in the same direction:

So I would phrase the diagnosis like this:

This still looks worth pursuing, but I would not expect class weighting alone to fix it. First prove that the labels are learnable with a tiny subset and an acoustic baseline. Then change the SSL extraction path: less padding, better pooling, layer selection, and partial fine-tuning.

Wav2Vec2 docs / attention-mask note:

[Wav2Vec2 · Hugging Face](https://huggingface.co/docs/transformers/en/model_doc/wav2vec2)

WavLM sequence classification docs:

[WavLM · Hugging Face](https://huggingface.co/docs/transformers/model_doc/wavlm)

HF audio classification guide:

[Audio classification · Hugging Face](https://huggingface.co/docs/transformers/en/tasks/audio_classification)

Wav2Vec2 post-convolution attention-mask issue:

[Return updated attention mask from Wav2Vec 2.0 · Issue #25307 · huggingface/transformers · GitHub](https://github.com/huggingface/transformers/issues/25307)

SUPERB benchmark / frozen SSL evaluation context:

[https://sls.csail.mit.edu/publications/2021/JeffLai_Interspeech_2021.pdf](https://sls.csail.mit.edu/publications/2021/JeffLai_Interspeech_2021.pdf)

Sigmatism detection paper:

[https://www.cstr.ed.ac.uk/downloads/publications/2012/Cassia_WOCCI12.pdf](https://www.cstr.ed.ac.uk/downloads/publications/2012/Cassia_WOCCI12.pdf)

Sibilant consonant classification with neural networks:

[https://dspace.rcaap.pt/entities/publication/c0cc15ad-cf57-4a4c-bfd4-423d2e2894f5](https://dspace.rcaap.pt/entities/publication/c0cc15ad-cf57-4a4c-bfd4-423d2e2894f5)

Wav2Vec2 layer-wise / suprasegmental analysis:

[[2408.13678] A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models](https://arxiv.org/abs/2408.13678)

Wav2Vec2 phonetic / tonal / speaker information analysis:

[[2506.10855] Analyzing the relationships between pretraining language, phonetic, tonal, and speaker information in self-supervised speech models](https://arxiv.org/abs/2506.10855)

Accent identification fine-tuning and phoneme/prosody probing:

[[2306.06524] What Can an Accent Identifier Learn? Probing Phonetic and Prosodic Information in a Wav2vec2-based Accent Identification Model](https://arxiv.org/abs/2306.06524)
