Deep Neural Network for Supervised Single-Channel Speech Enhancement

Nasir SALEEM; Muhammad IRFAN KHATTAK; Muhammad Yousaf ALI; Muhammad SHAFI

doi:10.24425/aoa.2019.126347

Authors

Nasir SALEEM Gomal University, Pakistan
Muhammad IRFAN KHATTAK University of Engineering and Technology, Pakistan
Muhammad Yousaf ALI Gomal University, Pakistan
Muhammad SHAFI Air University, Pakistan

Abstract

Speech enhancement is fundamental for various real time speech applications and it is a challenging task in the case of a single channel because practically only one data channel is available. We have proposed a supervised single channel speech enhancement algorithm in this paper based on a deep neural network (DNN) and less aggressive Wiener filtering as additional DNN layer. During the training stage the network learns and predicts the magnitude spectrums of the clean and noise signals from input noisy speech acoustic features. Relative spectral transform-perceptual linear prediction (RASTA-PLP) is used in the proposed method to extract the acoustic features at the frame level. Autoregressive moving average (ARMA) filter is applied to smooth the temporal curves of extracted features. The trained network predicts the coefficients to construct a ratio mask based on mean square error (MSE) objective cost function. The less aggressive Wiener filter is placed as an additional layer on the top of a DNN to produce an enhanced magnitude spectrum. Finally, the noisy speech phase is used to reconstruct the enhanced speech. The experimental results demonstrate that the proposed DNN framework with less aggressive Wiener filtering outperforms the competing speech enhancement methods in terms of the speech quality and intelligibility.

Keywords:

deep neural network, intelligibility, speech enhancement, speech quality, supervised learning, Wiener filtering

References

1. Boll S. (1979), Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27, 2, 113–120.

2. Chatlani N., Soraghan J.J. (2012), EMD–based filtering (EMDF) of low-frequency noise for speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, 20, 4, 1158–1166.

3. Chen F., Loizou P.C. (2010), Speech enhancement using a frequency-specific composite Wiener function, 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 4726–4729.

4. Doire C.S., Brookes M., Naylor P.A., Hicks C.M., Betts D., Dmour M.A., Jensen S.H. (2017), Single-channel online enhancement of speech corrupted by reverberation and noise, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25, 3, 572–587.

5. Duchi J., Hazan E., Singer Y. (2011), Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12, 2121–2159.

6. Ephraim Y., Malah D. (1984), Speech enhancement using a minimum–mean square error short–time spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, 32, 6, 1109–1121.

7. Ephraim Y., Malah D. (1985), Speech enhancement using a minimum mean–square error log–spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, 33, 2, 443–445.

8. Gerkmann T., Hendriks R.C. (2012), Unbiased MMSE-based noise power estimation with low complexity and low tracking delay, IEEE Transactions on Audio, Speech, and Language Processing, 20, 4, 1383–1393.

9. Hershey J.R., Rennie S.J., Olsen P.A., Kristjansson T.T. (2010), Super-human multi-talker speech recognition: A graphical modeling approach, Computer Speech & Language, 24, 1, 45–66.

10. Hirsch H.-G., Pearce D. (2000), The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW).

11. Huang P.-S., Kim M., Hasegawa-Johnson M., Smaragdis P. (2015), Joint optimization of masks and deep recurrent neural networks for monaural source separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23, 12, 2136–2147.

12. Hu Y., Loizou P.C. (2008), Evaluation of objective quality measures for speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, 16, 1, 229–238.

13. Kim H.-G., Jang G.-J., Park J.-S., Kim J.-H., Oh Y.-H. (2012), Speech segregation based on pitch track correction and music–speech classification, Advances in Electrical and Computer Engineering, 12, 2, 15–20.

14. Kolbk M., Tan Z.-H., Jensen J. (2017), Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25, 1, 153–167.

15. Krishnamoorthy P. (2011), An overview of subjective and objective quality measures for noisy speech enhancement algorithms, IETE Technical Review, 28, 4, 292–301.

16. Li N., Loizou P.C. (2008), Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction, The Journal of the Acoustical Society of America, 123, 3, 1673–1682.

17. Mohammadiha N., Smaragdis P., Leijon A. (2013), Supervised and unsupervised speech enhancement using nonnegative matrix factorization, IEEE Transactions on Audio, Speech, and Language Processing, 21, 10, 2140–2151.

18. Narayanan A., Wang D. (2013), The role of binary mask patterns in automatic speech recognition in background noise, The Journal of the Acoustical Society of America, 133, 5, 3083–3093.

19. Rix A.W., Beerends J.G., Hollier M.P., Hekstra A.P. (2001), Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs, ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing on Conference Proceedings, 2001 IEEE International Conference, 2, 749–752.

20. Roman N., Woodruff J. (2013), Speech intelligibility in reverberation with ideal binary masking: Effects of early reflections and signal-to-noise ratio threshold, The Journal of the Acoustical Society of America, 133, 3, 1707–1717.

21. Rothauser E. (1969), IEEE recommended practice for speech quality measurements, IEEE Transactions on Audio and Electroacoustics, 17, 225–246.

22. Saleem N., Irfan M. (2017), Noise reduction based on soft masks by incorporating SNR uncertainty in frequency domain, Circuits, Systems, and Signal Processing, 1–22.

23. Saleem N., Shafi M., Mustafa E., Nawaz A. (2015), A novel binary mask estimation based on spectral subtraction gain-induced distortions for improved speech intelligibility and quality, University of Engineering and Technology Taxila, Technical Journal, 20, 4, 36.

24. Scalart P. (1996), Speech enhancement based on a priori signal to noise estimation, ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing on Conference Proceedings, 1996 IEEE International Conference, 2, 629–632.

25. Schwerin B., Paliwal K. (2014), Using STFT real and imaginary parts of modulation signals for MMSE–based speech enhancement, Speech Communication, 58, 49–68.

26. Seltzer M.L., Yu D., Wang Y. (2013), An investigation of deep neural networks for noise robust speech recognition, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7398–7402.

27. Sun C., Xie J., Leng Y. (2016), A signal subspace speech enhancement approach based on joint low–rank and sparse matrix decomposition, Archives of Acoustics, 41, 2, 245–254

28. Sun M., Zhang X., Zheng T.F. (2016), Unseen noise estimation using separable deep auto encoder for speech enhancement, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24, 1, 93–104.

29. Sun M., Li Y., Gemmeke J.F., Zhang X. (2015), Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback-Leibler divergence, IEEE Transactions on Audio, Speech, and Language Processing, 23, 7, 1233–1242.

30. Taal C.H., Hendriks R.C., Heusdens R., Jensen J. (2011), An algorithm for intelligibility prediction of time-frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing, 19, 7, 2125–2136.

31. Wang D. (2005), On ideal binary mask as the computational goal of auditory scene analysis, [in:] Divenyi P. (Ed.), Speech separation by humans and machines, Springer, Boston, pp. 181–197.

32. Wang D., Brown G.J. (2006), Computational auditory scene analysis: Principles, algorithms, and applications, Wiley-IEEE Press, Hoboken, NJ.

33. Wang Y., Narayanan A., Wang D. (2014), On training targets for supervised speech separation, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22, 12, 1849–1858.

34. Wang Y., Wang D. (2013), Towards scaling up classification-based speech separation, IEEE Transactions on Audio, Speech, and Language Processing, 21, 7, 1381–1390.

35. Xu Y. et al. (2017), Unsupervised feature learning based on deep models for environmental audio tagging, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25, 6, 1230–1241.

36. Xu Y., Du J., Dai L.-R., Lee C.-H. (2014), An experimental study on speech enhancement based on deep neural networks, IEEE Signal processing letters, 21, 1, 65–68.

37. Xu Y., Du J., Dai L.-R., Lee C.-H. (2015), A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 23, 1, 7–19.

38. Zao L., Coelho R., Flandrin P. (2014), Speech enhancement with emd and hurst-based mode selection, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22, 5, 899–911.

Online first
Early birds
2025, Vol 50
	No 1	No 2	No 3	No 4
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

Deep Neural Network for Supervised Single-Channel Speech Enhancement

Downloads

Authors

Abstract

Keywords:

References

cover

ippt-pan

Issue

Pages

Section

DOI

Received

Accepted

Published

License

How to Cite

Principal Contact

Address

Support Contact