Analysis of Decision Fusion in Speech Detection

Downloads

Authors

  • Tomasz MAKA Faculty of Computer Science and Information Technology, West Pomeranian University of Technology in Szczecin, Poland
  • Lukasz SMIETANKA Faculty of Computer Science and Information Technology, West Pomeranian University of Technology in Szczecin, Poland

Abstract

This article addresses the issue of detecting speech signal segments in an acoustic signal and analyzes potential decision fusion for a group of voice activity detectors (VADs). We designed ten new VADs using three different types of neural network architectures and three time-frequency signal representations. One of the proposed models has higher classification efficiency than competitive solutions. We used our VAD models to analyse data fusion and improve the final classification decision. For this purpose, we used gradient-free and gradient-based optimizers with different objective functions. The analysis revealed the impact of individual classifiers on the final decisions and the potential gains or losses resulting from VAD fusion. Compared with existing models, the models we proposed achieved higher classification accuracy at the cost of increased memory requirements. The final choice of a specific model depends on the platform constraints on which the VAD system will be deployed.

Keywords:

voice activity detection (VAD), deep neural networks, data fusion

References


  1. Aloradi A., Elminshawi M., Chetupalli S.R., Habets E.A.P (2023), Target-speaker voice activity detection in multi-talker scenarios: An empirical study, [in:] Speech Communication – 15th ITG Conference, pp. 250–254, https://doi.org/10.30420/456164049

  2. Blanke S. (2020), Gradient-Free-Optimizers: Simple and reliable optimization with local, global, population-based and sequential techniques in numerical search spaces, https://github.com/SimonBlanke/Gradient-Free-Optimizers (access: 16.06.2024).

  3. Dosovitskiy A. et al. (2021), An image is worth 16x16 words: Transformers for image recognition at scale, https://arxiv.org/abs/2010.11929

  4. He K., Zhang X., Ren S., Sun J. (2016), Deep residual learning for image recognition, [in:] 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, https://doi.org/10.1109/CVPR.2016.90

  5. Kim T., Chang J., Ko J.H. (2022), ADA-VAD: Unpaired adversarial domain adaptation for noise-robust voice activity detection, [in:] ICASSP 2022 – 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7327–7331, https://doi.org/10.1109/icassp43922.2022.9746755

  6. Kingma D.P., Ba J. (2015), Adam: A method for stochastic optimization, [in:] ICLR 2015.

  7. Kittler J., Hatef M., Duin R.P.W., Matas J. (1998), On combining classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3): 226–239, https://doi.org/10.1109/34.667881

  8. Lavechin M. et al. (2023), Brouhaha: Multi-task training for voice activity detection, speech-to-noise ratio, and C50 room acoustics estimation, https://doi.org/10.48550/arXiv.2210.13248

  9. Ma C., Dai G., Zhou J. (2022), Short-term traffic flow prediction for urban road sections based on time series analysis and LSTM BILSTM method, IEEE Transactions on Intelligent Transportation Systems, 23(6): 5615–5624, https://doi.org/10.1109/tits.2021.3055258

  10. McFee B. et al. (2015), librosa: Audio and music signal analysis in python, [in:] Proceedings of the 14th Python in Science Conference, https://doi.org/10.25080/Majora-7b98e3ed-003

  11. Peinado A.M., Segura J.C. (2006), Speech Recognition Over Digital Channels: Robustness and Standards, Wiley, https://doi.org/10.1002/0470024720

  12. Rabiner L.R., Schafer R.W. (2010), Theory and Applications of Digital Speech Processing, Pearson.

  13. Rijsbergen C.J.V. (1979), Information Retrieval, 2nd ed., Butterworth-Heinemann.

  14. Rokach L. (2005), Ensemble methods for classifiers, [in:] Data Mining and Knowledge Discovery Handbook, Maimon O., Rokach L. [Eds], pp. 957–980, Springer, https://doi.org/10.1007/0-387-25465-x_45

  15. Schörkhuber C., Klapuri A. (2010), Constant-Q transform toolbox for music processing, [in:] 7th Sound and Music Computing Conference (SMC2010), https://doi.org/10.5281/zenodo.849741

  16. Smietanka L., Maka T. (2023), Augmented transformer for speech detection in adverse acoustical conditions, [in:] 2023 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), pp. 14–18, https://doi.org/10.23919/spa59660.2023.10274438

  17. Song S., Desplanques B., Demuynck K., Madhu N. (2022), SoftVAD in iVector-based acoustic scene classification for robustness to foreground speech, [in:] 2022 30th European Signal Processing Conference (EUSIPCO), pp. 404–408, https://doi.org/10.23919/eusipco55093.2022.9909938

  18. Svirsky J., Lindenbaum O. (2023), SG-VAD: Stochastic gates based speech activity detection, [in:] ICASSP 2023 – 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/icassp49357.2023.10096938

  19. Team S. (2024), Silero VAD: Pre-trained enterprise-grade voice activity detector (VAD), number detector and language classifier, https://github.com/snakers4/silero-vad

  20. Wang R., Moazzen I., Zhu W.-P. (2022), A computation-efficient neural network for VAD using multi-channel feature, [in:] 2022 30th European Signal Processing Conference (EUSIPCO), pp. 170–174, https://doi.org/10.23919/eusipco55093.2022.9909914

  21. Yadav S., Legaspi P.A.D., Alink M.S.O., Kokkeler A.B.J., Nauta B. (2023), Hardware implementations for voice activity detection: Trends, challenges and outlook, IEEE Transactions on Circuits and Systems I: Regular Papers, 70(3): 1083–1096, https://doi.org/10.1109/tcsi.2022.3225717

  22. Yang Q., Liu Q., Li N., Ge M., Song Z., Li H. (2024), SVAD: A robust, low-power, and light-weight voice activity detection with spiking neural networks, [in:] ICASSP 2024 – 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 221–225, https://doi.org/10.1109/icassp48485.2024.10446945

  23. Zhang Y., Zou H., Zhu J. (2023), Vsanet: Realtime speech enhancement based on voice activity detection and causal spatial attention, [in:] 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–8, https://doi.org/10.1109/asru57964.2023.10389633

  24. Zhao Y., Champagne B. (2022), An efficient transformer-based model for voice activity detection, [in:] 2022 IEEE 32nd International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, https://doi.org/10.1109/mlsp55214.2022.9943501