Deep Temporal Modeling and Ensemble Fusion for Multimodal Emotion Recognition from Physiological Signals

Researchers achieved 98.91% accuracy in multimodal emotion recognition from physiological signals by combining LSTM, TCN, and Transformer models with late-fusion ensemble on the WESAD dataset. Transformer models excelled in multimodal settings, while TCN performed best with wrist-only data, demonstrating the effectiveness of sensor fusion and ensemble strategies.

arXiv:2606.15026v1 Announce Type: new Abstract: Physiological stress and emotion recognition are important for health monitoring and affective computing. In this work, we present a comprehensive evaluation of deep learning models such as Long Short-Term Memory LSTM , Temporal Convolutional Networks TCN , and Transformer on the WESAD dataset for multimodal affect recognition using wrist and chest sensor signals. We perform ablation studies to assess the individual contributions of each modality by training models on wrist-only and chest-only inputs. In addition, we implement a late-fusion ensemble strategy that combines predictions from all three architectures trained on multimodal input. We also employ early fusion at the sensor level by concatenating wrist and chest signals before feeding them into each model. Our results show that Transformer models consistently achieve the highest accuracy in multimodal settings, while TCN models perform best in the wrist-only configuration. The ensemble method yields the highest overall accuracy 98.91 +/- 0.13% and macro-F1 score 98.56 +/- 0.17% . These findings demonstrate the effectiveness of sensor fusion and ensemble-based fusion in developing robust systems for physiological emotion recognition.