Phonological Perception of Sign Language Models

Researchers evaluated the phonological perception of deep learning models for sign language recognition, finding that pose-based models are sensitive to handshape contrasts while pixel-based models better capture location changes, and pose-based models correlate with human perceptual similarity judgments (r~0.49). The study reveals emergent phonological sensitivity in these models but with architectural trade-offs, suggesting current training paradigms are insufficient to overcome inductive biases.

arXiv:2606.28667v1 Announce Type: new Abstract: Sign languages are compositional systems where meaning arises by combining sublexical phonological parameters, such as handshape, location, and movement. While deep learning models for Sign Language Recognition SLR have achieved increased performance on translation benchmarks, it remains unclear whether these models distinguish abstract phonological features or merely rely on low-level statistical correlations. This work evaluates the phonological perception of SLR models trained on American Sign Language ASL by probing phonological sensitivity using minimal pairs and evaluating representational alignment with human behavioral data. Our results reveal that SLR models exhibit emergent phonological sensitivity, but with clear architectural trade-offs: pose-based models are sensitive to handshape contrasts, while pixel-based models better capture location changes. Furthermore, pose-based models learn latent representations that correlate with human perceptual similarity judgments r~0.49 . These findings suggest that while SLR models exhibit emergent phonology, current training paradigms are insufficient to scale them beyond their architectural inductive biases.