Archives of Acoustics, 44, 1, pp. 3–12, 2019
10.24425/aoa.2019.126347

Deep Neural Network for Supervised Single-Channel Speech Enhancement

Nasir SALEEM
Gomal University
Pakistan

Muhammad IRFAN KHATTAK
University of Engineering and Technology
Pakistan

Muhammad Yousaf ALI
Gomal University
Pakistan

Muhammad SHAFI
Air University
Pakistan

Speech enhancement is fundamental for various real time speech applications and it is a challenging task in the case of a single channel because practically only one data channel is available. We have proposed a supervised single channel speech enhancement algorithm in this paper based on a deep neural network (DNN) and less aggressive Wiener filtering as additional DNN layer. During the training stage the network learns and predicts the magnitude spectrums of the clean and noise signals from input noisy speech acoustic features. Relative spectral transform-perceptual linear prediction (RASTA-PLP) is used in the proposed method to extract the acoustic features at the frame level. Autoregressive moving average (ARMA) filter is applied to smooth the temporal curves of extracted features. The trained network predicts the coefficients to construct a ratio mask based on mean square error (MSE) objective cost function. The less aggressive Wiener filter is placed as an additional layer on the top of a DNN to produce an enhanced magnitude spectrum. Finally, the noisy speech phase is used to reconstruct the enhanced speech. The experimental results demonstrate that the proposed DNN framework with less aggressive Wiener filtering outperforms the competing speech enhancement methods in terms of the speech quality and intelligibility.
Keywords: deep neural network; intelligibility; speech enhancement; speech quality; supervised learning; Wiener filtering
Full Text: PDF
Copyright © The Author(s). This is an open-access article distributed under the terms of the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0).

References

Boll S. (1979), Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27, 2, 113–120.

Chatlani N., Soraghan J.J. (2012), EMD–based filtering (EMDF) of low-frequency noise for speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, 20, 4, 1158–1166.

Chen F., Loizou P.C. (2010), Speech enhancement using a frequency-specific composite Wiener function, 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 4726–4729.

Doire C.S., Brookes M., Naylor P.A., Hicks C.M., Betts D., Dmour M.A., Jensen S.H. (2017), Single-channel online enhancement of speech corrupted by reverberation and noise, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25, 3, 572–587.

Duchi J., Hazan E., Singer Y. (2011), Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12, 2121–2159.

Ephraim Y., Malah D. (1984), Speech enhancement using a minimum–mean square error short–time spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, 32, 6, 1109–1121.

Ephraim Y., Malah D. (1985), Speech enhancement using a minimum mean–square error log–spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, 33, 2, 443–445.

Gerkmann T., Hendriks R.C. (2012), Unbiased MMSE-based noise power estimation with low complexity and low tracking delay, IEEE Transactions on Audio, Speech, and Language Processing, 20, 4, 1383–1393.

Hershey J.R., Rennie S.J., Olsen P.A., Kristjansson T.T. (2010), Super-human multi-talker speech recognition: A graphical modeling approach, Computer Speech & Language, 24, 1, 45–66.

Hirsch H.-G., Pearce D. (2000), The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW).

Huang P.-S., Kim M., Hasegawa-Johnson M., Smaragdis P. (2015), Joint optimization of masks and deep recurrent neural networks for monaural source separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23, 12, 2136–2147.

Hu Y., Loizou P.C. (2008), Evaluation of objective quality measures for speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, 16, 1, 229–238.

Kim H.-G., Jang G.-J., Park J.-S., Kim J.-H., Oh Y.-H. (2012), Speech segregation based on pitch track correction and music–speech classification, Advances in Electrical and Computer Engineering, 12, 2, 15–20.

Kolbk M., Tan Z.-H., Jensen J. (2017), Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25, 1, 153–167.

Krishnamoorthy P. (2011), An overview of subjective and objective quality measures for noisy speech enhancement algorithms, IETE Technical Review, 28, 4, 292–301.

Li N., Loizou P.C. (2008), Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction, The Journal of the Acoustical Society of America, 123, 3, 1673–1682.

Mohammadiha N., Smaragdis P., Leijon A. (2013), Supervised and unsupervised speech enhancement using nonnegative matrix factorization, IEEE Transactions on Audio, Speech, and Language Processing, 21, 10, 2140–2151.

Narayanan A., Wang D. (2013), The role of binary mask patterns in automatic speech recognition in background noise, The Journal of the Acoustical Society of America, 133, 5, 3083–3093.

Rix A.W., Beerends J.G., Hollier M.P., Hekstra A.P. (2001), Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs, ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing on Conference Proceedings, 2001 IEEE International Conference, 2, 749–752.

Roman N., Woodruff J. (2013), Speech intelligibility in reverberation with ideal binary masking: Effects of early reflections and signal-to-noise ratio threshold, The Journal of the Acoustical Society of America, 133, 3, 1707–1717.

Rothauser E. (1969), IEEE recommended practice for speech quality measurements, IEEE Transactions on Audio and Electroacoustics, 17, 225–246.

Saleem N., Irfan M. (2017), Noise reduction based on soft masks by incorporating SNR uncertainty in frequency domain, Circuits, Systems, and Signal Processing, 1–22.

Saleem N., Shafi M., Mustafa E., Nawaz A. (2015), A novel binary mask estimation based on spectral subtraction gain-induced distortions for improved speech intelligibility and quality, University of Engineering and Technology Taxila, Technical Journal, 20, 4, 36.

Scalart P. (1996), Speech enhancement based on a priori signal to noise estimation, ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing on Conference Proceedings, 1996 IEEE International Conference, 2, 629–632.

Schwerin B., Paliwal K. (2014), Using STFT real and imaginary parts of modulation signals for MMSE–based speech enhancement, Speech Communication, 58, 49–68.

Seltzer M.L., Yu D., Wang Y. (2013), An investigation of deep neural networks for noise robust speech recognition, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7398–7402.

Sun C., Xie J., Leng Y. (2016), A signal subspace speech enhancement approach based on joint low–rank and sparse matrix decomposition, Archives of Acoustics, 41, 2, 245–254

Sun M., Zhang X., Zheng T.F. (2016), Unseen noise estimation using separable deep auto encoder for speech enhancement, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24, 1, 93–104.

Sun M., Li Y., Gemmeke J.F., Zhang X. (2015), Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback-Leibler divergence, IEEE Transactions on Audio, Speech, and Language Processing, 23, 7, 1233–1242.

Taal C.H., Hendriks R.C., Heusdens R., Jensen J. (2011), An algorithm for intelligibility prediction of time-frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing, 19, 7, 2125–2136.

Wang D. (2005), On ideal binary mask as the computational goal of auditory scene analysis, [in:] Divenyi P. (Ed.), Speech separation by humans and machines, Springer, Boston, pp. 181–197.

Wang D., Brown G.J. (2006), Computational auditory scene analysis: Principles, algorithms, and applications, Wiley-IEEE Press, Hoboken, NJ.

Wang Y., Narayanan A., Wang D. (2014), On training targets for supervised speech separation, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22, 12, 1849–1858.

Wang Y., Wang D. (2013), Towards scaling up classification-based speech separation, IEEE Transactions on Audio, Speech, and Language Processing, 21, 7, 1381–1390.

Xu Y. et al. (2017), Unsupervised feature learning based on deep models for environmental audio tagging, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25, 6, 1230–1241.

Xu Y., Du J., Dai L.-R., Lee C.-H. (2014), An experimental study on speech enhancement based on deep neural networks, IEEE Signal processing letters, 21, 1, 65–68.

Xu Y., Du J., Dai L.-R., Lee C.-H. (2015), A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 23, 1, 7–19.

Zao L., Coelho R., Flandrin P. (2014), Speech enhancement with emd and hurst-based mode selection, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22, 5, 899–911.




DOI: 10.24425/aoa.2019.126347