Comparative Study of Visual Feature for Bimodal Hindi Speech Recognition

Prashant UPADHYAYA; Omar FAROOQ; Musiur Raza ABIDI; Priyanka VARSHNEY

doi:10.1515/aoa-2015-0061

Authors

Prashant UPADHYAYA Aligarh Muslim University, India
Omar FAROOQ Aligarh Muslim University, India
Musiur Raza ABIDI Aligarh Muslim University, India
Priyanka VARSHNEY Mindz Technology, India

Abstract

In building speech recognition based applications, robustness to different noisy background condition is an important challenge. In this paper bimodal approach is proposed to improve the robustness of Hindi speech recognition system. Also an importance of different types of visual features is studied for audio visual automatic speech recognition (AVASR) system under diverse noisy audio conditions. Four sets of visual feature based on Two-Dimensional Discrete Cosine Transform feature (2D-DCT), Principal Component Analysis (PCA), Two-Dimensional Discrete Wavelet Transform followed by DCT (2D-DWT-DCT) and Two-Dimensional Discrete Wavelet Transform followed by PCA (2D-DWT-PCA) are reported. The audio features are extracted using Mel Frequency Cepstral coefficients (MFCC) followed by static and dynamic feature. Overall, 48 features, i.e. 39 audio features and 9 visual features are used for measuring the performance of the AVASR system. Also, the performance of the AVASR using noisy speech signal generated by using NOISEX database is evaluated for different Signal to Noise ratio (SNR: 30 dB to -10 dB) using Aligarh Muslim University Audio Visual (AMUAV) Hindi corpus. AMUAV corpus is Hindi continuous speech high quality audio visual databases of Hindi sentences spoken by different subjects.

Keywords:

Aligarh Muslim University audio visual corpus, AVASR, bimodal, DCT, DWT.

References

1. Abdelaziz A.H., Zeiler S., Kolossa D. (2015), Learning Dynamic Stream Weights for Coupled-HMM-Based Audio-Visual Speech Recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 5, pp. 863–876.

2. Ahmad N., Mulvaney D., Datta S., Farooq O. (2008), A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition, Acoustic, Paris, 6445–6449.

3. Bruce L.M., Koger C.H., Jiang Li (2002), Dimensionality Reduction of Hyperspectral Data Using Discrete Wavelet Transform Feature Extraction. IEEE Transactions on Geoscience and Remote Sensing, vol. 40, no. 10, pp. 2331–2338.

4. Carboneras A., Gurban M., Thiran J. (2007), Low Dimensional Motion Features for Audio-Visual Speech Recognition, Proceeding of the 15th European Signal Processing Conference (EUSIPCO), Poznan, Poland, pp. 297–301.

5. Cardinaux F., Sanderson C., Sebastien M. (2003), Comparison of MLP and GMM Classifiers for Face Verification on Xm2vts, IDIAP research report (IDIAP 3-10), pp. 1–9.

6. Chen T. (2001), Audio visual speech processing, IEEE Signal Processing Magazine, pp. 9–31.

7. Chourasia V., Samudravijaya K., Ingle M., Chandwani M. (2007), Hindi Speech Recognition under Noisy Conditions, International Journal of Acoustic Society India, pp. 41-46.

8. Farooq O., Datta S., Vyas A. (2005), Robust isolated Hindi digit recognition using wavelet based denoising for speech enhancement, Journal of Acoustical Society of India, vol. 33, no. 1-4, pp. 386–389.

9. Farooq O., Datta S., Shrotriya M. (2010), Wavelet sub-band based temporal features for robust Hindi phoneme recognition, International Journal on Wavelets and Multiresolution Information Processing, vol. 8, no. 6, pp. 847–859.

10. Florian H., Georg S., Christian H., Fabio B. (2005), Revising Perceptual Linear Prediction (PLP), Proceeding of 9th European Conference on Speech Communication and Technology, Interspeech 2005, pp. 2997–3000.

11. Gundimada S., Asari V. (2004), Face detection technique based on rotation invariant wavelet features, Information Technology: Coding and Computing, vol. 2, pp. 157–158.

12. Hansen J., Zhang X. (2009), Analysis of CFA-BF: Novel combined fixed/adaptive beamforming for robust speech recognition in real car environments, Speech Communication, vol. 52, pp. 134–149.

13. Huang J., Potamianos G., Connell J., Neti C. (2004), Audio-visual speech recognition using an infrared headset, Speech Communication, vol. 44, no. 4, pp. 83–96.

14. Khanam R., Mumtaz S.M., Farooq O., Datta S., Vyas A.L. (2010), Audio Visual Features For Stop Recognition From Continuous Hindi Speech, National Symposium on Acoustics.

15. Lee J.S., Park C.H. (2008), Robust audio visual speech recognition based on late integration, IEEE Transactions on Multimedia, August, Vol. 10, No. 5, pp. 767–779.

16. Lindsay I Smith (2002). A tutorial on Principal Components Analysis, page 2-8.

17. Lokesh S., Balakrishnan G. (2012), Robust Speech Feature Prediction Using Mel-LPC to Improve Recognition Accuracy, Information Technology Journal, vol. 11, no. 1, pp. 1644–1649.

18. Mishra A., Chandra M., Biswas M., Sharan S. (2011), Robust Features for Connected Hindi Digits Recognition, International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 4, no. 2, pp. 79–90.

19. Naomi H., Eoin G. (2015), TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech, IEEE Transactions on Multimedia, vol. 17, No. 5, pp. 603–615.

20. Navnath S., Raghunath S. (2012), DWT and LPC based feature extraction methods for isolated word recognition, EURASIP Journal on Audio, Speech, and Music Processing, pp. 1–7.

21. Neti C., Rajput N., Verma A. (2002), A Large Vocabulary Continuous Speech Recognition System For Hindi, Proceeding of Works. Multimedia Signal Process, pp. 475–481.

22. Neti C., Potamianos G., Luettin J., Matthews I., Glotin H., Vergyri D., Sison J., Mashari A., Zhou J. ( 2000), Technical report on Audio Visual Speech Recognition, Center for Language and Speech Processing, The John Hopkins University, Baltimore.

23. Patterson E., Gurbuz S., Tufekci Z., Gowdy J. (2002), CUAVE: A new audio-visual database for multimodal human-computer interface research, Proceeding of the IEEE International Conference of Acoustics, Speech, and Signal Processing, vol. 2, pp. 2017–2020.

24. Potamianos G., Neti C. (2001), Automatic Speech reading for Impaired Speech, Proceedings of the Audio Visual Speech Processing Workshop.

25. Potamianos G., Neti C. (2003), Audio visual speech recognition in challenging environments, Proceedings of the European Conference on Speech Communication and Technology, Geneva, Switzerland, pp. 1293–1296.

26. Potamianos G., Neti C., Gravier G., Garg A., Andrew W. (2003), Recent Advances in the Automatic Recognition of Audio visual Speech. Invite paper. Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326.

27. Potamianos G., Neti C., Luettin J., Matthews I. (2004), Chapter to appear in: Issues in Visual and Audio-Visual Speech Processing, MIT Press.

28. Pradhan G., Haris C., Prasanna S., Sinha R. (2012), Speaker verification in sensor and acoustic environment mismatch conditions, International Journal of Speech Technology (Springer), vol. 15, no. 3, pp. 381–392.

29. Samudravijaya K. (2004), Variable Frame Size Analysis for Speech Recognition, Proceedings of the International conference on Natural Language Processing, Dec 19-22, Hyderabad, pp. 237–244.

30. Sanderson C., Paliwal K. (2004), On the use of speech and face information for identity verification. In: IDIAP research report (IDIAP–RR 04-10), pp. 1–33.

31. Seymour R., Stewart D., Jiming (2008). Comparison of image transform-based features for visual speech recognition in clean and corrupted videos, Journal on Image and Video Processing, EURASIP, Hindawi Publishing Corporation, vol. 2008, pp. 1–9.

32. Tim J., Thomas B., Nicholas R.C., Ray M., Guy J.B. (2013), The robustness of speech representations obtained from simulated auditory nerve fibers under different noise conditions, JASA Express Letters, Journal of the Acoustical Society of America, Vol. 134, no. 3, pp. 282–288.

33. Upadhyaya P., Farooq O., Varshney P. (2012), Comparative study of viseme recognition by using DCT feature, Proceeding of the International Symposium Frontier Research on Speech and Music(FRSM), Gurgaon, Haryana, India, pp. 171–175.

34. Upadhyaya P., Farooq O., Varshney P., Upadhyaya A. (2013), Enhancement of VSR Using Low Dimension Visual Feature, IEEE Proceeding of the International Conference on Multimedia Signal Processing and Communication Technologies (IMPACT), AMU, Aligarh, India, pp.71–74.

35. Upadhyaya P., Farooq O., Abidi M.R., Varshney P. (2014), Performance Evaluation of Bimodal Hindi Speech Recognition under Adverse Environment, Advances in Intelligent Systems and Computing, Springer International Publishing. vol. 328. pp. 347–355.

36. Varga A., Stceneken H.J.M. (1993), Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems, Speech Communication, vol. 12, no. 3, pp. 247–251.

37. Varshney P., Farooq O., Upadhyaya P. (2014), Hindi viseme recognition using subspace DCT features, International Journal of Applied Pattern Recognition (IJAPR), vol. 1, No. 3, pp. 257–272.

38. Young S. (2008), HMMS and related speech recognition technologies, Handbook on Speech Processing and Speech Communication, Springer-Verlag Berlin Heidelberg, pp. 539–558.

39. Zhou Z., Hong X., Zhao G., Pietikainen M. (2014), A compact representation of visual speech data using latent variables, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 1, pp. 181–187.

40. http://www.internationalphoneticalphabet.org

Online first
Early birds
2026, Vol 51
	No 1
2025, Vol 50
	No 1	No 2	No 3	No 4
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

Comparative Study of Visual Feature for Bimodal Hindi Speech Recognition

Downloads

Authors

Abstract

Keywords:

References

Other articles by the same author(s)

cover

ippt-pan

Issue

Pages

Section

DOI

License

How to Cite

Principal Contact

Address

Support Contact