Archives of Acoustics, 49, 4, pp. 471–481, 2024
10.24425/aoa.2024.148805

Fine-Grained Recognition of Fidgety-Type Emotions Using Multi-Scale One-Dimensional Residual Siamese Network

Jiu SUN
School of Information Technology, Yancheng Institute of Technology
China

Junxin ZHU
School of Information Technology, Yancheng Institute of Technology
China

Jun SHAO
School of Information Technology, Yancheng Institute of Technology
China

Fidgety speech emotion has important research value, and many deep learning models have played a good role in feature modeling in recent years. In this paper, the problem of practical speech emotion is studied, and the improvement is made on fidgety-type emotion using a novel neural network model. First, we construct a large number of phonological features for modeling emotions. Second, the differences in fidgety speech between various groups of people were studied. Through the distribution of features, the individual features of fidgety emotion were studied. Third, we propose a fine-grained emotion classification method, which analyzes the subtle differences between emotional categories through Siamese neural networks. We propose to use multi-scale residual blocks within the network architecture, and alleviate the vanishing gradient problem. This allows the network to learn more meaningful representations of fidgety speech signal. Finally, the experimental results show that the proposed method can provide the versatility of modeling, and that fidgety emotion is well identified. It has great research value in practical applications.
Keywords: residual convolutional neural network; multi-scale neural network; fidgety speech emotion; finegrained emotion classification; Siamese neural networks
Full Text: PDF
Copyright © 2024 The Author(s). This work is licensed under the Creative Commons Attribution 4.0 International CC BY 4.0.

References

Abdeljaber O., Avci O., Kiranyaz S., Gabbouj M., Inman D.J. (2017), Real-time vibration-based structural damage detection using one-dimensional convolutional neural networks, Journal of Sound and Vibration, 388: 154–170, doi: 10.1016/j.jsv.2016.10.043.

Atila O., Sengür A. (2021), Attention guided 3D CNN-LSTM model for accurate speech based emotion recognition, Applied Acoustics, 182: 108260, doi: 10.1016/j.apacoust.2021.108260.

Atsavasirilert K. et al. (2019), A light-weight deep convolutional neural network for speech emotion recognition using mel-spectrograms, [in:] Proceedings of 14th International Joint Symposium on Artificial Intelligence and Natural Language Processing (ISAI -NLP), pp. 1–4, doi: 10.1109/isai-nlp48611.2019.9045511.

Avci O., Abdeljaber O., Kiranyaz S., Hussein M., Inman D.J. (2018), Wireless and real-time structural damage detection: A novel decentralized method for wireless sensor networks, Journal of Sound and Vibration, 424: 158–172, doi: 10.1016/j.jsv.2018.03.008.

Avci O., Abdeljaber O., Kiranyaz S., Inman D.J. (2019), Convolutional neural networks for real-time

and wireless damage detection, Dynamics of Civil Structures, 2: 129–136, doi: 10.1007/978-3-030-12115-0_17.

Chen Q., Huang G. (2021), A novel dual attention-based BLSTM with hybrid features in speech emotion recognition, Engineering Applications of Artificial Intelligence, 102: 104277, doi: 10.1016/j.engappai.2021.104277.

Dupuis K., Pichora-Fuller M.K. (2014), Recognition of emotional speech for younger and older talkers, Ear & Hearing, 35(6): 695–707, doi: 10.1097/aud.0000000000000082.

Huang C., Chen G., Yu H., Bao Y., Zhao L. (2013a), Speech emotion recognition under white noise,

Archives of Acoustics, 38(4): 457–463, doi: 10.2478/aoa-2013-0054.

Huang C., Jin Y., Zhao Y., Yu Y., Zhao L. (2009a), Speech emotion recognition based on re-composition of two-class classifiers, [in:] Proceedings of 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, pp. 1–3, doi: 10.1109/acii.2009.5349420.

Huang C., Jin Y., Zhao Y., Yu Y., Zhao L. (2009b), Recognition of practical emotion from elicited speech, [in:] Proceedings of the First International Conference on Information Science and Engineering, pp. 1–4, doi: 10.1109/icise.2009.875.

Huang C., Liang R., Wang Q., Xi J., Zha C., Zhao L. (2013b), Practical speech emotion recognition based on online learning: From acted data to elicited data, Mathematical Problems in Engineering, 2013: 265819, doi: 10.1155/2013/265819.

Huang C., Song B., Zhao L. (2016), Emotional speech feature normalization and recognition based on speaker-sensitive feature clustering, International Journal of Speech Technology, 19(4): 805–816, doi: 10.1007/s10772-016-9371-3.

Huang C., Zhao Y., Jin Y., Yu Y., Zhao L. (2011), A study on feature analysis and recognition of practical speech emotion, Journal of Electronics & Information Technology, 33(1): 112–116, doi: 10.3724/sp.j.1146.2009.00886.

Jin Y., Huang C., Zhao L. (2011), A semi-supervised learning algorithm based on modified self-training SVM, Journal of Computers, 6(7): 1438–1443, doi: 10.4304/jcp.6.7.1438-1443.

Jin Y., Song P., Zheng W., Zhao L. (2014), A feature selection and feature fusion combination method for speaker-independent speech emotion recognition, [in:] Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4808–4812, doi: 10.1109/icassp.2014.6854515.

Jin Y., Zhao Y., Huang C., Zhao L. (2009), Study on the emotion recognition of whispered speech, [in:] Proceedings of 2009 WRI Global Congress on Intelligent Systems, pp. 242–246, doi: 10.1109/gcis.2009.175.

Kiranyaz S., Gastli A., Ben-Brahim L., Al-Emadi N., Gabbouj M. (2019), Real-time fault detection and identification for MMC using 1-D convolutional neural networks, IEEE Transactions on Industrial Electronics, 66(11): 8760–8771, doi: 10.1109/tie.2018.2833045.

Latif S., Shahid A., Qadir J. (2023), Generative emotional AI for speech emotion recognition: The case for synthetic emotional speech augmentation, Applied Acoustics, 210: 109425, doi: 10.1016/j.apacoust.2023.109425.

Lieskovská E., Jakubec M., Jarina R., Chmulík M. (2021), A review on speech emotion recognition using deep learning and attention mechanism, Electronics, 10(10): 1163, doi: 10.3390/electronics10101163.

Praseetha V.M., Vadivel S. (2018), Deep learning models for speech emotion recognition, Journal of Computer Science, 14(11): 1577–1587, doi: 10.3844/jcssp.2018.1577.1587.

Wang Z.Q., Tashev I. (2017), Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks, [in:] Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5150–5154, doi: 10.1109/icassp.2017.7953138.

Wu C., Huang C., Chen H. (2018), Text-independent speech emotion recognition using frequency adaptive features, Multimedia Tools and Applications, 77(18): 24353–24363, doi: 10.1007/s11042-018-5742-x.

Xiong Z., Stiles M., Zhao J. (2017), Robust ECG signal classification for the detection of atrial fibrillation using novel neural networks, [in:] Proceedings of 2017 Computing in Cardiology Conference (CinC), 44, doi: 10.22489/cinc.2017.066-138.

Xu X., Huang C., Wu C., Wang Q., Zhao L. (2014), Graph learning based speaker independent speech emotion recognition, Advances in Electrical and Computer Engineering, 14(2): 17–22, doi: 10.4316/aece.2014.02003.

Yan J., Wang X., Gu W., Ma L. (2013), Speech emotion recognition based on sparse representation, Archives of Acoustics, 38(4): 465–470, doi: 10.2478/aoa-2013-0055.

Zhou Q. et al. (2021), Cough recognition based on Mel-spectrogram and convolutional neural network, Frontiers in Robotics and AI, 8: 1–7, doi: 10.3389/frobt.2021.580080.

Zou C., Huang C., Han D., Zhao L. (2011), Detecting practical speech emotion in a cognitive task, [in:] Proceedings of 20th International Conference on Computer Communications and Networks (ICCCN), pp. 1–5, doi: 10.1109/icccn.2011.6005883.




DOI: 10.24425/aoa.2024.148805