10.24425/aoa.2019.126347
Deep Neural Network for Supervised Single-Channel Speech Enhancement
References
Boll S. (1979), Suppression of acoustic noise in speech using spectral subtraction. IEEE Transactions on Acoustics, Speech, and Signal Processing, 27, 2, 113–120.
Chatlani N., Soraghan J.J. (2012), EMD–based filtering (EMDF) of low-frequency noise for speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, 20, 4, 1158–1166.
Chen F., Loizou P.C. (2010), Speech enhancement using a frequency-specific composite Wiener function, 2010 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 4726–4729.
Doire C.S., Brookes M., Naylor P.A., Hicks C.M., Betts D., Dmour M.A., Jensen S.H. (2017), Single-channel online enhancement of speech corrupted by reverberation and noise, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25, 3, 572–587.
Duchi J., Hazan E., Singer Y. (2011), Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research, 12, 2121–2159.
Ephraim Y., Malah D. (1984), Speech enhancement using a minimum–mean square error short–time spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, 32, 6, 1109–1121.
Ephraim Y., Malah D. (1985), Speech enhancement using a minimum mean–square error log–spectral amplitude estimator, IEEE Transactions on Acoustics, Speech, and Signal Processing, 33, 2, 443–445.
Gerkmann T., Hendriks R.C. (2012), Unbiased MMSE-based noise power estimation with low complexity and low tracking delay, IEEE Transactions on Audio, Speech, and Language Processing, 20, 4, 1383–1393.
Hershey J.R., Rennie S.J., Olsen P.A., Kristjansson T.T. (2010), Super-human multi-talker speech recognition: A graphical modeling approach, Computer Speech & Language, 24, 1, 45–66.
Hirsch H.-G., Pearce D. (2000), The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions, ASR2000-Automatic Speech Recognition: Challenges for the new Millenium ISCA Tutorial and Research Workshop (ITRW).
Huang P.-S., Kim M., Hasegawa-Johnson M., Smaragdis P. (2015), Joint optimization of masks and deep recurrent neural networks for monaural source separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 23, 12, 2136–2147.
Hu Y., Loizou P.C. (2008), Evaluation of objective quality measures for speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, 16, 1, 229–238.
Kim H.-G., Jang G.-J., Park J.-S., Kim J.-H., Oh Y.-H. (2012), Speech segregation based on pitch track correction and music–speech classification, Advances in Electrical and Computer Engineering, 12, 2, 15–20.
Kolbk M., Tan Z.-H., Jensen J. (2017), Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25, 1, 153–167.
Krishnamoorthy P. (2011), An overview of subjective and objective quality measures for noisy speech enhancement algorithms, IETE Technical Review, 28, 4, 292–301.
Li N., Loizou P.C. (2008), Factors influencing intelligibility of ideal binary-masked speech: Implications for noise reduction, The Journal of the Acoustical Society of America, 123, 3, 1673–1682.
Mohammadiha N., Smaragdis P., Leijon A. (2013), Supervised and unsupervised speech enhancement using nonnegative matrix factorization, IEEE Transactions on Audio, Speech, and Language Processing, 21, 10, 2140–2151.
Narayanan A., Wang D. (2013), The role of binary mask patterns in automatic speech recognition in background noise, The Journal of the Acoustical Society of America, 133, 5, 3083–3093.
Rix A.W., Beerends J.G., Hollier M.P., Hekstra A.P. (2001), Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs, ICASSP '01 Proceedings of the Acoustics, Speech, and Signal Processing on Conference Proceedings, 2001 IEEE International Conference, 2, 749–752.
Roman N., Woodruff J. (2013), Speech intelligibility in reverberation with ideal binary masking: Effects of early reflections and signal-to-noise ratio threshold, The Journal of the Acoustical Society of America, 133, 3, 1707–1717.
Rothauser E. (1969), IEEE recommended practice for speech quality measurements, IEEE Transactions on Audio and Electroacoustics, 17, 225–246.
Saleem N., Irfan M. (2017), Noise reduction based on soft masks by incorporating SNR uncertainty in frequency domain, Circuits, Systems, and Signal Processing, 1–22.
Saleem N., Shafi M., Mustafa E., Nawaz A. (2015), A novel binary mask estimation based on spectral subtraction gain-induced distortions for improved speech intelligibility and quality, University of Engineering and Technology Taxila, Technical Journal, 20, 4, 36.
Scalart P. (1996), Speech enhancement based on a priori signal to noise estimation, ICASSP '96 Proceedings of the Acoustics, Speech, and Signal Processing on Conference Proceedings, 1996 IEEE International Conference, 2, 629–632.
Schwerin B., Paliwal K. (2014), Using STFT real and imaginary parts of modulation signals for MMSE–based speech enhancement, Speech Communication, 58, 49–68.
Seltzer M.L., Yu D., Wang Y. (2013), An investigation of deep neural networks for noise robust speech recognition, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7398–7402.
Sun C., Xie J., Leng Y. (2016), A signal subspace speech enhancement approach based on joint low–rank and sparse matrix decomposition, Archives of Acoustics, 41, 2, 245–254
Sun M., Zhang X., Zheng T.F. (2016), Unseen noise estimation using separable deep auto encoder for speech enhancement, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 24, 1, 93–104.
Sun M., Li Y., Gemmeke J.F., Zhang X. (2015), Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback-Leibler divergence, IEEE Transactions on Audio, Speech, and Language Processing, 23, 7, 1233–1242.
Taal C.H., Hendriks R.C., Heusdens R., Jensen J. (2011), An algorithm for intelligibility prediction of time-frequency weighted noisy speech, IEEE Transactions on Audio, Speech, and Language Processing, 19, 7, 2125–2136.
Wang D. (2005), On ideal binary mask as the computational goal of auditory scene analysis, [in:] Divenyi P. (Ed.), Speech separation by humans and machines, Springer, Boston, pp. 181–197.
Wang D., Brown G.J. (2006), Computational auditory scene analysis: Principles, algorithms, and applications, Wiley-IEEE Press, Hoboken, NJ.
Wang Y., Narayanan A., Wang D. (2014), On training targets for supervised speech separation, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22, 12, 1849–1858.
Wang Y., Wang D. (2013), Towards scaling up classification-based speech separation, IEEE Transactions on Audio, Speech, and Language Processing, 21, 7, 1381–1390.
Xu Y. et al. (2017), Unsupervised feature learning based on deep models for environmental audio tagging, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25, 6, 1230–1241.
Xu Y., Du J., Dai L.-R., Lee C.-H. (2014), An experimental study on speech enhancement based on deep neural networks, IEEE Signal processing letters, 21, 1, 65–68.
Xu Y., Du J., Dai L.-R., Lee C.-H. (2015), A regression approach to speech enhancement based on deep neural networks, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 23, 1, 7–19.
Zao L., Coelho R., Flandrin P. (2014), Speech enhancement with emd and hurst-based mode selection, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 22, 5, 899–911.
DOI: 10.24425/aoa.2019.126347