Speech Enhancement Based on Constrained Low-rank Sparse Matrix Decomposition Integrated with Temporal Continuity Regularisation
Abstract
Speech enhancement in strong noise condition is a challenging problem. Low-rank and sparse matrix decomposition (LSMD) theory has been applied to speech enhancement recently and good performance was obtained. Existing LSMD algorithms consider each frame as an individual observation. However, real-world speeches usually have a temporal structure, and their acoustic characteristics vary slowly as a function of time. In this paper, we propose a temporal continuity constrained low-rank sparse matrix decomposition (TCCLSMD) based speech enhancement method. In this method, speech separation is formulated as a TCCLSMD problem and temporal continuity constraints are imposed in the LSMD process. We develop an alternative optimisation algorithm for noisy spectrogram decomposition. By means of TCCLSMD, the recovery speech spectrogram is more consistent with the structure of the clean speech spectrogram, and it can lead to more stable and reasonable results than the existing LSMD algorithm. Experiments with various types of noises show the proposed algorithm can achieve a better performance than traditional speech enhancement algorithms, in terms of yielding less residual noise and lower speech distortion.Keywords:
speech enhancement, temporal continuity, low-rank and sparse decompositionReferences
1. Abdali S., NaserSharif B. (2017), Non-negative matrix factorization for speech/music separation using source dependent decomposition rank, temporal continuity term and filtering, Biomedical Signal Processing and Control, 36, 168–175, https://doi.org/10.1016/j.bspc.2017.03.010
2. Bando Y. et al. (2018), Speech enhancement based on Bayesian low-rank and sparse decomposition of multichannel magnitude spectrograms, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26, 2, 215–230, https://doi.org/10.1109/TASLP.2017.2772340
3. Boll S.F. (1979), Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Audio, Speech, and Signal Processing, 27, 2, 113–120, https://doi.org/10.1109/TASSP.1979.1163209
4. Bouwmans T., Sobral A., Javed S., Jung S.K., Zahzah E.-H. (2017), Decomposition into low-rank plus additive matrices for background/foreground separation: A review for a comparative evaluation with a large-scale dataset, Computer Science Review, 23, 1–71, https://doi.org/10.1016/j.cosrev.2016.11.001
5. Cai J.F., Candès E.J., Shen Z. (2010), A singular value thresholding algorithm for matrix completion, SIAM Journal on Optimization, 20, 4, 1956–1982, https://doi.org/10.1137/080738970
6. Candes E.J., Li X., Ma Y., Wright J. (2011), Robust principal component analysis? Journal of the ACM, 58, 3, 1–37, https://doi.org/10.1145/1970392.1970395
7. Candes E.J., Plan Y. (2010), Matrix completion with noise, Proceedings of the IEEE, 98, 6, 925–936, https://doi.org/10.1109/JPROC.2009.2035722
8. Cohen I. (2004), Speech enhancement using a noncausal a priori SNR estimator, IEEE Signal Processing Letters, 11, 9, 725–728, https://doi.org/10.1109/LSP.2004.833478
9. Ephraim Y., Van Trees H. (1995), A signal subspace approach for speech enhancement, IEEE Transactions on Speech and Audio Processing, 3(4), 251–266, https://doi.org/10.1109/89.397090
10. Hermus K., Wambacq P., Hamme H.V. (2007), A review of signal subspace speech enhancement and its application to noise robust speech recognition, EURASIP Journal on Advances in Signal Processing, 1–15, https://doi.org/10.1155/2007/45821
11. Hu Y., Loizou P.C. (2003), A generalized subspace approach for enhancing speech corrupted by colored noise, IEEE Transactions on Audio, Speech and Language Processing, 11, 4, 334–342, https://doi.org/10.1109/TSA.2003.814458
12. Hu Y., Loizou P.C. (2008), Evaluation of objective quality measures for speech enhancement, IEEE Transactions on Audio, Speech and Language Processing, 16, 1, 229–230, https://doi.org/10.1109/TASL.2007.911054
13. Jin K.H., Ye J.C. (2018), Sparse and low-rank decomposition of a hankel structured matrix for impulse noise removal, IEEE Transactions on Image Processing, 27, 3, 1448–1461, https://doi.org/10.1109/TIP.2017.2771471
14. Kammi S., Mollaei M.R.K. (2017), Noisy speech enhancement with sparsity regularization, Speech Communication, 87, 58–69, https://doi.org/10.1016/j.specom.2017.01.003
15. Kheder W.B., Matrouf D., Bousquet P.-M., Bonastre J.-F., Ajili M. (2017), Fast i-vector denoising using MAP estimation and a noise distributions database for robust speaker recognition, Computer Speech & Language, 45, 104–122, https://doi.org/10.1016/j.csl.2016.12.007
16. Kolbæk M., Tan Z.-H., Jensen J. (2017), Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25, 1, 153–167, https://doi.org/10.1109/TASLP.2016.2628641
17. Li X., Fan M., Liu L., Li W. (2018), Distributed-microphones based in-vehicle speech enhancement via sparse and low-rank spectrogram decomposition, Speech Communication, 98, 51–62, 10.1016/j.specom.2017.12.008.
18. Liu H., Peng J. (2018), Sparse signal recovery via alternating projection method, Signal Processing, 143, 161–170, https://doi.org/10.1016/j.sigpro.2017.09.003
19. Loizou P.C. (2007), Speech Enhancement: Theory and Practice, New York: Taylor & Francis.
20. Lu Y., Loizou P.C. (2008), A geometric approach to spectral subtraction, Speech Communication, 50, 6, 453–466, https://doi.org/10.1016/j.specom.2008.01.003
21. Mavaddaty S., Ahadi S. M., Seyedin S. (2016), A novel speech enhancement method by learnable sparse and low-rank decompositionand domain adaptation, Speech Communication, 76, 42–60, 10.1016/j.specom.2015.11.003.
22. Mohammadiha N., Arne L. (2013), Nonnegative HMM for babble noise derived from speech HMM: Application to speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, 21, 5, 998–1011, https://doi.org/10.1109/TASL.2013.2243435
23. Moor, de B. (1993), The singular value decomposition and long and short spaces of noisy matrices, IEEE Transactions on Signal Processing, 41, 9, 2826–2839, https://doi.org/10.1109/78.236505
24. Paliwal K., Schwerin B., Wójcicki K. (2012), Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator, Speech Communication, 54, 2, 282–305, https://doi.org/10.1016/j.specom.2011.09.003
25. Paliwal K., Wójcicki K., Schwerin B. (2010), Single-channel speech enhancement using spectral subtraction in the short-time modulation domain, Speech Communication, 52, 5, 450–475, doi: /10.1016/j.specom.2010.02.004.
26. Plapous C., Marro C., Scalart P. (2006), Improved signal-to-noise ratio estimation for speech enhancement, IEEE Transactions on Acoustics, Speech, and Signal Processing, 14, 6, 2098–2108, https://doi.org/10.1109/TASL.2006.872621
27. Quatieri T. (2002), Discrete-time speech signal processing: principles and practice, Prentice Hall, Upper Saddle River, NJ.
28. Rugini L., Banelli P. (2016), On the equivalence of maximum SNR and MMSE estimation: applications to additive non-Gaussian channels and quantized observations, IEEE Transactions on Signal Processing, 64, 23, 6190–6199, https://doi.org/10.1109/TSP.2016.2607152
29. Scalart P., Vieira-Filho J. (1996), Speech enhancement based on a priori signal to noise estimation. Proceedings on 21st IEEE International Conference on Acoustics, Speech, and Signal Processing Conference, Atlanta, GA, https://doi.org/10.1109/ICASSP.1996.543199
30. Shannon B., Paliwal K. (2006), Role of phase estimation in speech enhancement, [in:] INTERSPEECH-2006, paper 1330-Tue3FoP.4,
https://www.isca-speech.org/archive/archive_papers/interspeech_2006/i06_1330.pdf
31. Shi J., Song W. (2016), Sparse principal component analysis with measurement errors, Journal of Statistical Planning and Inference, 175, 87–99, https://doi.org/10.1016/j.jspi.2016.03.001
32. Stark A., Paliwal K. (2011), Use of speech presence uncertainty with MMSE spectral energy estimation for robust automatic speech recognition, Speech Communication, 53, 1, 51–61, 10.1016/j.specom.2010.08.001.
33. Sun C., Mu J. (2015), An eigenvalue filtering based subspace approach for speech enhancement, Noise Control Engineering Journal, 63, 1, 36–48, https://doi.org/10.3397/1/376305
34. Sun C., Xie J., Leng Y. (2016), A signal subspace speech enhancement approach based on joint low-rank and sparse matrix decomposition, Archives of Acoustics, 41, 2, 245–254, 10.1515/aoa-2016-0024.
35. Sun C., Zhu Q., Wan M. (2014), A novel speech enhancement method based on constrained low-rank and sparse matrix decomposition, Speech Communication, 60, 44–55, https://doi.org/10.1016/j.specom.2014.03.002
36. Sun M., Li Y., Gemmeke J.F., Zhang X. (2015), Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback-Leibler divergence, IEEE Transactions on Audio, Speech, and Language Processing, 23, 7, 1233–1242, https://doi.org/10.1109/TASLP.2015.2427520
37. Tan H., Cheng B., Feng J., Feng G., Wang W., Zhang Y.-J. (2013), Low-n-rank tensor recovery based on multi-linear augmented Lagrange multiplier method, Neurocomputing, 119, 144–152, https://doi.org/10.1016/j.neucom.2012.03.039
38. Virtanen T. (2007), Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria, IEEE Transactions on Audio, Speech, and Language Processing, 15, 3, 1066–1074, https://doi.org/10.1109/TASL.2006.885253
39. Wiener N. (1949), Extrapolation, interpolation, and smoothing of stationary time series, New York: Wiley.
40. Wright J., Ganesh A., Rao S., Peng Y., Ma Y. (2009), Robust principal component analysis: exact recovery of corrupted low-rank matrices via convex optimization, [in:] Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, A. Culotta (Eds), pp. 2080–2088,
http://papers.nips.cc/paper/3704-robust-principal-component-analysis-exact-recovery-of-corrupted-low-rank-matrices-via-convex-optimization.pdf
41. Xu H., Caramanis C., Sanghavi S. (2012), Robust PCA via outlier pursuit, IEEE Transactions on Information Theory, 58, 5, 3047–3064, https://doi.org/10.1109/TIT.2011.2173156
42. Zhang Y., Zhao Y. (2013), Real and imaginary modulation spectral subtraction for speech enhancement, Speech Communication, 55, 4, 509–522, https://doi.org/10.1016/j.specom.2012.09.005
43. Zhen L., Peng D., Yi Z., Xiang Y., Chen P. (2017), Underdetermined blind source separation using sparse coding, IEEE Transactions on Neural Networks and Learning Systems, 28, 12, 3102–3108, 10.1109/TNNLS.2016.2610960.
2. Bando Y. et al. (2018), Speech enhancement based on Bayesian low-rank and sparse decomposition of multichannel magnitude spectrograms, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26, 2, 215–230, https://doi.org/10.1109/TASLP.2017.2772340
3. Boll S.F. (1979), Suppression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Audio, Speech, and Signal Processing, 27, 2, 113–120, https://doi.org/10.1109/TASSP.1979.1163209
4. Bouwmans T., Sobral A., Javed S., Jung S.K., Zahzah E.-H. (2017), Decomposition into low-rank plus additive matrices for background/foreground separation: A review for a comparative evaluation with a large-scale dataset, Computer Science Review, 23, 1–71, https://doi.org/10.1016/j.cosrev.2016.11.001
5. Cai J.F., Candès E.J., Shen Z. (2010), A singular value thresholding algorithm for matrix completion, SIAM Journal on Optimization, 20, 4, 1956–1982, https://doi.org/10.1137/080738970
6. Candes E.J., Li X., Ma Y., Wright J. (2011), Robust principal component analysis? Journal of the ACM, 58, 3, 1–37, https://doi.org/10.1145/1970392.1970395
7. Candes E.J., Plan Y. (2010), Matrix completion with noise, Proceedings of the IEEE, 98, 6, 925–936, https://doi.org/10.1109/JPROC.2009.2035722
8. Cohen I. (2004), Speech enhancement using a noncausal a priori SNR estimator, IEEE Signal Processing Letters, 11, 9, 725–728, https://doi.org/10.1109/LSP.2004.833478
9. Ephraim Y., Van Trees H. (1995), A signal subspace approach for speech enhancement, IEEE Transactions on Speech and Audio Processing, 3(4), 251–266, https://doi.org/10.1109/89.397090
10. Hermus K., Wambacq P., Hamme H.V. (2007), A review of signal subspace speech enhancement and its application to noise robust speech recognition, EURASIP Journal on Advances in Signal Processing, 1–15, https://doi.org/10.1155/2007/45821
11. Hu Y., Loizou P.C. (2003), A generalized subspace approach for enhancing speech corrupted by colored noise, IEEE Transactions on Audio, Speech and Language Processing, 11, 4, 334–342, https://doi.org/10.1109/TSA.2003.814458
12. Hu Y., Loizou P.C. (2008), Evaluation of objective quality measures for speech enhancement, IEEE Transactions on Audio, Speech and Language Processing, 16, 1, 229–230, https://doi.org/10.1109/TASL.2007.911054
13. Jin K.H., Ye J.C. (2018), Sparse and low-rank decomposition of a hankel structured matrix for impulse noise removal, IEEE Transactions on Image Processing, 27, 3, 1448–1461, https://doi.org/10.1109/TIP.2017.2771471
14. Kammi S., Mollaei M.R.K. (2017), Noisy speech enhancement with sparsity regularization, Speech Communication, 87, 58–69, https://doi.org/10.1016/j.specom.2017.01.003
15. Kheder W.B., Matrouf D., Bousquet P.-M., Bonastre J.-F., Ajili M. (2017), Fast i-vector denoising using MAP estimation and a noise distributions database for robust speaker recognition, Computer Speech & Language, 45, 104–122, https://doi.org/10.1016/j.csl.2016.12.007
16. Kolbæk M., Tan Z.-H., Jensen J. (2017), Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems, IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25, 1, 153–167, https://doi.org/10.1109/TASLP.2016.2628641
17. Li X., Fan M., Liu L., Li W. (2018), Distributed-microphones based in-vehicle speech enhancement via sparse and low-rank spectrogram decomposition, Speech Communication, 98, 51–62, 10.1016/j.specom.2017.12.008.
18. Liu H., Peng J. (2018), Sparse signal recovery via alternating projection method, Signal Processing, 143, 161–170, https://doi.org/10.1016/j.sigpro.2017.09.003
19. Loizou P.C. (2007), Speech Enhancement: Theory and Practice, New York: Taylor & Francis.
20. Lu Y., Loizou P.C. (2008), A geometric approach to spectral subtraction, Speech Communication, 50, 6, 453–466, https://doi.org/10.1016/j.specom.2008.01.003
21. Mavaddaty S., Ahadi S. M., Seyedin S. (2016), A novel speech enhancement method by learnable sparse and low-rank decompositionand domain adaptation, Speech Communication, 76, 42–60, 10.1016/j.specom.2015.11.003.
22. Mohammadiha N., Arne L. (2013), Nonnegative HMM for babble noise derived from speech HMM: Application to speech enhancement, IEEE Transactions on Audio, Speech, and Language Processing, 21, 5, 998–1011, https://doi.org/10.1109/TASL.2013.2243435
23. Moor, de B. (1993), The singular value decomposition and long and short spaces of noisy matrices, IEEE Transactions on Signal Processing, 41, 9, 2826–2839, https://doi.org/10.1109/78.236505
24. Paliwal K., Schwerin B., Wójcicki K. (2012), Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator, Speech Communication, 54, 2, 282–305, https://doi.org/10.1016/j.specom.2011.09.003
25. Paliwal K., Wójcicki K., Schwerin B. (2010), Single-channel speech enhancement using spectral subtraction in the short-time modulation domain, Speech Communication, 52, 5, 450–475, doi: /10.1016/j.specom.2010.02.004.
26. Plapous C., Marro C., Scalart P. (2006), Improved signal-to-noise ratio estimation for speech enhancement, IEEE Transactions on Acoustics, Speech, and Signal Processing, 14, 6, 2098–2108, https://doi.org/10.1109/TASL.2006.872621
27. Quatieri T. (2002), Discrete-time speech signal processing: principles and practice, Prentice Hall, Upper Saddle River, NJ.
28. Rugini L., Banelli P. (2016), On the equivalence of maximum SNR and MMSE estimation: applications to additive non-Gaussian channels and quantized observations, IEEE Transactions on Signal Processing, 64, 23, 6190–6199, https://doi.org/10.1109/TSP.2016.2607152
29. Scalart P., Vieira-Filho J. (1996), Speech enhancement based on a priori signal to noise estimation. Proceedings on 21st IEEE International Conference on Acoustics, Speech, and Signal Processing Conference, Atlanta, GA, https://doi.org/10.1109/ICASSP.1996.543199
30. Shannon B., Paliwal K. (2006), Role of phase estimation in speech enhancement, [in:] INTERSPEECH-2006, paper 1330-Tue3FoP.4,
https://www.isca-speech.org/archive/archive_papers/interspeech_2006/i06_1330.pdf
31. Shi J., Song W. (2016), Sparse principal component analysis with measurement errors, Journal of Statistical Planning and Inference, 175, 87–99, https://doi.org/10.1016/j.jspi.2016.03.001
32. Stark A., Paliwal K. (2011), Use of speech presence uncertainty with MMSE spectral energy estimation for robust automatic speech recognition, Speech Communication, 53, 1, 51–61, 10.1016/j.specom.2010.08.001.
33. Sun C., Mu J. (2015), An eigenvalue filtering based subspace approach for speech enhancement, Noise Control Engineering Journal, 63, 1, 36–48, https://doi.org/10.3397/1/376305
34. Sun C., Xie J., Leng Y. (2016), A signal subspace speech enhancement approach based on joint low-rank and sparse matrix decomposition, Archives of Acoustics, 41, 2, 245–254, 10.1515/aoa-2016-0024.
35. Sun C., Zhu Q., Wan M. (2014), A novel speech enhancement method based on constrained low-rank and sparse matrix decomposition, Speech Communication, 60, 44–55, https://doi.org/10.1016/j.specom.2014.03.002
36. Sun M., Li Y., Gemmeke J.F., Zhang X. (2015), Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback-Leibler divergence, IEEE Transactions on Audio, Speech, and Language Processing, 23, 7, 1233–1242, https://doi.org/10.1109/TASLP.2015.2427520
37. Tan H., Cheng B., Feng J., Feng G., Wang W., Zhang Y.-J. (2013), Low-n-rank tensor recovery based on multi-linear augmented Lagrange multiplier method, Neurocomputing, 119, 144–152, https://doi.org/10.1016/j.neucom.2012.03.039
38. Virtanen T. (2007), Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria, IEEE Transactions on Audio, Speech, and Language Processing, 15, 3, 1066–1074, https://doi.org/10.1109/TASL.2006.885253
39. Wiener N. (1949), Extrapolation, interpolation, and smoothing of stationary time series, New York: Wiley.
40. Wright J., Ganesh A., Rao S., Peng Y., Ma Y. (2009), Robust principal component analysis: exact recovery of corrupted low-rank matrices via convex optimization, [in:] Advances in Neural Information Processing Systems 22, Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, A. Culotta (Eds), pp. 2080–2088,
http://papers.nips.cc/paper/3704-robust-principal-component-analysis-exact-recovery-of-corrupted-low-rank-matrices-via-convex-optimization.pdf
41. Xu H., Caramanis C., Sanghavi S. (2012), Robust PCA via outlier pursuit, IEEE Transactions on Information Theory, 58, 5, 3047–3064, https://doi.org/10.1109/TIT.2011.2173156
42. Zhang Y., Zhao Y. (2013), Real and imaginary modulation spectral subtraction for speech enhancement, Speech Communication, 55, 4, 509–522, https://doi.org/10.1016/j.specom.2012.09.005
43. Zhen L., Peng D., Yi Z., Xiang Y., Chen P. (2017), Underdetermined blind source separation using sparse coding, IEEE Transactions on Neural Networks and Learning Systems, 28, 12, 3102–3108, 10.1109/TNNLS.2016.2610960.

