Hmm… it looks like this may be one of those problems that is not unsolvable, but is deeper than it first appears:
Direct answers to your four suspected points:
Linear probing / head-only training: I would not expect head-only training to be reliable for this task. It is worth testing as a baseline, but for Normal vs Lateral vs Interdental /s,z/, I would try at least freeze_feature_encoder()
only, or unfreezing the last few transformer layers.
Learning rate: 1e-3
may be reasonable for a newly initialized head, but it is probably too high if you unfreeze the encoder. If you fine-tune encoder layers, I would use separate parameter groups: lower LR for the backbone, higher LR for the classifier head.
Padding / attention mask: yes, I would treat the current fixed 1.0s padding as a major suspect. The issue is not only “should I pass attention_mask
?”, because Wav2Vec2 attention-mask behavior is checkpoint-dependent. The bigger issue is that the actual fricative may be a small fraction of the pooled sequence.
Class imbalance: I would handle it, but not first. Since the largest class is 580/1057, an always-Lateral baseline would be around 55%, so a ~33% result suggests something deeper than ordinary class imbalance.
So my first debugging order would be:
tiny-subset overfit
→ label / fold / trainable-parameter sanity checks
→ padding + pooling ablation
→ partial unfreezing / layer selection
→ acoustic baseline
→ then class weighting or sampling
Your class counts are:
| class | count |
|---|---|
| Lateral | 580 |
| Normal | 243 |
| Interdental | 234 |
| total | 1057 |
If the model truly predicted only the majority class on the whole distribution, the accuracy should be closer to 55%, not 33%. Fold-level distributions can change this, of course, but the shape still makes me suspicious of something else:
Class imbalance may still hurt macro-F1 and minority recall, but I would not make it the first explanation.
The task itself does not look impossible. Similar sibilant / sigmatism-style classification problems have been studied before, often with MFCCs, log-Mel features, spectral-band energy, CNNs, or other acoustic features. So I would not conclude from the 33% result that the labels are necessarily impossible or that Wav2Vec2/WavLM are useless here.
But the current setup combines several hard conditions:
| issue | why it matters |
|---|---|
| median clip is ~0.16s | the useful fricative signal may only be a few feature frames |
| all clips padded/truncated to 1.0s | the pooled representation may be dominated by padding / near-silence behavior |
no attention_mask |
|
| default sequence pooling may average over frames that are not really useful | |
freeze_base_model() |
|
| only the classification head adapts, not the SSL representation | |
| default pooled sequence classifier | a very local acoustic contrast becomes one global vector |
| 3-way subtle articulation labels | Normal / Lateral / Interdental may require fine spectral/phonetic cues |
So I would treat the current result as a sign that the pipeline needs ablation, not as a final verdict on the task.
My best guess is that this is not one single bug. It is a stack of smaller issues pointing in the same direction:
So I would phrase the diagnosis like this:
This still looks worth pursuing, but I would not expect class weighting alone to fix it. First prove that the labels are learnable with a tiny subset and an acoustic baseline. Then change the SSL extraction path: less padding, better pooling, layer selection, and partial fine-tuning.
Wav2Vec2 docs / attention-mask note:
WavLM sequence classification docs:
HF audio classification guide:
Audio classification · Hugging Face
Wav2Vec2 post-convolution attention-mask issue:
Return updated attention mask from Wav2Vec 2.0 · Issue #25307 · huggingface/transformers · GitHub
SUPERB benchmark / frozen SSL evaluation context:
https://sls.csail.mit.edu/publications/2021/JeffLai_Interspeech_2021.pdf
Sigmatism detection paper:
https://www.cstr.ed.ac.uk/downloads/publications/2012/Cassia_WOCCI12.pdf
Sibilant consonant classification with neural networks:
https://dspace.rcaap.pt/entities/publication/c0cc15ad-cf57-4a4c-bfd4-423d2e2894f5
Wav2Vec2 layer-wise / suprasegmental analysis:
[2408.13678] A layer-wise analysis of Mandarin and English suprasegmentals in SSL speech models
Wav2Vec2 phonetic / tonal / speaker information analysis:
Accent identification fine-tuning and phoneme/prosody probing: