Archives of Acoustics, 40, 4, pp. 609–619, 2015

Comparative Study of Visual Feature for Bimodal Hindi Speech Recognition

Aligarh Muslim University

Aligarh Muslim University

Musiur Raza ABIDI
Aligarh Muslim University

Mindz Technology

In building speech recognition based applications, robustness to different noisy background condition is an important challenge. In this paper bimodal approach is proposed to improve the robustness of Hindi speech recognition system. Also an importance of different types of visual features is studied for audio visual automatic speech recognition (AVASR) system under diverse noisy audio conditions. Four sets of visual feature based on Two-Dimensional Discrete Cosine Transform feature (2D-DCT), Principal Component Analysis (PCA), Two-Dimensional Discrete Wavelet Transform followed by DCT (2D-DWT-DCT) and Two-Dimensional Discrete Wavelet Transform followed by PCA (2D-DWT-PCA) are reported. The audio features are extracted using Mel Frequency Cepstral coefficients (MFCC) followed by static and dynamic feature. Overall, 48 features, i.e. 39 audio features and 9 visual features are used for measuring the performance of the AVASR system. Also, the performance of the AVASR using noisy speech signal generated by using NOISEX database is evaluated for different Signal to Noise ratio (SNR: 30 dB to -10 dB) using Aligarh Muslim University Audio Visual (AMUAV) Hindi corpus. AMUAV corpus is Hindi continuous speech high quality audio visual databases of Hindi sentences spoken by different subjects.
Keywords: Aligarh Muslim University audio visual corpus; AVASR; bimodal; DCT; DWT.
Full Text: PDF


Abdelaziz A.H., Zeiler S., Kolossa D. (2015), Learning Dynamic Stream Weights for Coupled-HMM-Based Audio-Visual Speech Recognition, IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, No. 5, pp. 863–876.

Ahmad N., Mulvaney D., Datta S., Farooq O. (2008), A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition, Acoustic, Paris, 6445–6449.

Bruce L.M., Koger C.H., Jiang Li (2002), Dimensionality Reduction of Hyperspectral Data Using Discrete Wavelet Transform Feature Extraction. IEEE Transactions on Geoscience and Remote Sensing, vol. 40, no. 10, pp. 2331–2338.

Carboneras A., Gurban M., Thiran J. (2007), Low Dimensional Motion Features for Audio-Visual Speech Recognition, Proceeding of the 15th European Signal Processing Conference (EUSIPCO), Poznan, Poland, pp. 297–301.

Cardinaux F., Sanderson C., Sebastien M. (2003), Comparison of MLP and GMM Classifiers for Face Verification on Xm2vts, IDIAP research report (IDIAP 3-10), pp. 1–9.

Chen T. (2001), Audio visual speech processing, IEEE Signal Processing Magazine, pp. 9–31.

Chourasia V., Samudravijaya K., Ingle M., Chandwani M. (2007), Hindi Speech Recognition under Noisy Conditions, International Journal of Acoustic Society India, pp. 41-46.

Farooq O., Datta S., Vyas A. (2005), Robust isolated Hindi digit recognition using wavelet based denoising for speech enhancement, Journal of Acoustical Society of India, vol. 33, no. 1-4, pp. 386–389.

Farooq O., Datta S., Shrotriya M. (2010), Wavelet sub-band based temporal features for robust Hindi phoneme recognition, International Journal on Wavelets and Multiresolution Information Processing, vol. 8, no. 6, pp. 847–859.

Florian H., Georg S., Christian H., Fabio B. (2005), Revising Perceptual Linear Prediction (PLP), Proceeding of 9th European Conference on Speech Communication and Technology, Interspeech 2005, pp. 2997–3000.

Gundimada S., Asari V. (2004), Face detection technique based on rotation invariant wavelet features, Information Technology: Coding and Computing, vol. 2, pp. 157–158.

Hansen J., Zhang X. (2009), Analysis of CFA-BF: Novel combined fixed/adaptive beamforming for robust speech recognition in real car environments, Speech Communication, vol. 52, pp. 134–149.

Huang J., Potamianos G., Connell J., Neti C. (2004), Audio-visual speech recognition using an infrared headset, Speech Communication, vol. 44, no. 4, pp. 83–96.

Khanam R., Mumtaz S.M., Farooq O., Datta S., Vyas A.L. (2010), Audio Visual Features For Stop Recognition From Continuous Hindi Speech, National Symposium on Acoustics.

Lee J.S., Park C.H. (2008), Robust audio visual speech recognition based on late integration, IEEE Transactions on Multimedia, August, Vol. 10, No. 5, pp. 767–779.

Lindsay I Smith (2002). A tutorial on Principal Components Analysis, page 2-8.

Lokesh S., Balakrishnan G. (2012), Robust Speech Feature Prediction Using Mel-LPC to Improve Recognition Accuracy, Information Technology Journal, vol. 11, no. 1, pp. 1644–1649.

Mishra A., Chandra M., Biswas M., Sharan S. (2011), Robust Features for Connected Hindi Digits Recognition, International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 4, no. 2, pp. 79–90.

Naomi H., Eoin G. (2015), TCD-TIMIT: An Audio-Visual Corpus of Continuous Speech, IEEE Transactions on Multimedia, vol. 17, No. 5, pp. 603–615.

Navnath S., Raghunath S. (2012), DWT and LPC based feature extraction methods for isolated word recognition, EURASIP Journal on Audio, Speech, and Music Processing, pp. 1–7.

Neti C., Rajput N., Verma A. (2002), A Large Vocabulary Continuous Speech Recognition System For Hindi, Proceeding of Works. Multimedia Signal Process, pp. 475–481.

Neti C., Potamianos G., Luettin J., Matthews I., Glotin H., Vergyri D., Sison J., Mashari A., Zhou J. ( 2000), Technical report on Audio Visual Speech Recognition, Center for Language and Speech Processing, The John Hopkins University, Baltimore.

Patterson E., Gurbuz S., Tufekci Z., Gowdy J. (2002), CUAVE: A new audio-visual database for multimodal human-computer interface research, Proceeding of the IEEE International Conference of Acoustics, Speech, and Signal Processing, vol. 2, pp. 2017–2020.

Potamianos G., Neti C. (2001), Automatic Speech reading for Impaired Speech, Proceedings of the Audio Visual Speech Processing Workshop.

Potamianos G., Neti C. (2003), Audio visual speech recognition in challenging environments, Proceedings of the European Conference on Speech Communication and Technology, Geneva, Switzerland, pp. 1293–1296.

Potamianos G., Neti C., Gravier G., Garg A., Andrew W. (2003), Recent Advances in the Automatic Recognition of Audio visual Speech. Invite paper. Proceedings of the IEEE, vol. 91, no. 9, pp. 1306–1326.

Potamianos G., Neti C., Luettin J., Matthews I. (2004), Chapter to appear in: Issues in Visual and Audio-Visual Speech Processing, MIT Press.

Pradhan G., Haris C., Prasanna S., Sinha R. (2012), Speaker verification in sensor and acoustic environment mismatch conditions, International Journal of Speech Technology (Springer), vol. 15, no. 3, pp. 381–392.

Samudravijaya K. (2004), Variable Frame Size Analysis for Speech Recognition, Proceedings of the International conference on Natural Language Processing, Dec 19-22, Hyderabad, pp. 237–244.

Sanderson C., Paliwal K. (2004), On the use of speech and face information for identity verification. In: IDIAP research report (IDIAP–RR 04-10), pp. 1–33.

Seymour R., Stewart D., Jiming (2008). Comparison of image transform-based features for visual speech recognition in clean and corrupted videos, Journal on Image and Video Processing, EURASIP, Hindawi Publishing Corporation, vol. 2008, pp. 1–9.

Tim J., Thomas B., Nicholas R.C., Ray M., Guy J.B. (2013), The robustness of speech representations obtained from simulated auditory nerve fibers under different noise conditions, JASA Express Letters, Journal of the Acoustical Society of America, Vol. 134, no. 3, pp. 282–288.

Upadhyaya P., Farooq O., Varshney P. (2012), Comparative study of viseme recognition by using DCT feature, Proceeding of the International Symposium Frontier Research on Speech and Music(FRSM), Gurgaon, Haryana, India, pp. 171–175.

Upadhyaya P., Farooq O., Varshney P., Upadhyaya A. (2013), Enhancement of VSR Using Low Dimension Visual Feature, IEEE Proceeding of the International Conference on Multimedia Signal Processing and Communication Technologies (IMPACT), AMU, Aligarh, India, pp.71–74.

Upadhyaya P., Farooq O., Abidi M.R., Varshney P. (2014), Performance Evaluation of Bimodal Hindi Speech Recognition under Adverse Environment, Advances in Intelligent Systems and Computing, Springer International Publishing. vol. 328. pp. 347–355.

Varga A., Stceneken H.J.M. (1993), Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems, Speech Communication, vol. 12, no. 3, pp. 247–251.

Varshney P., Farooq O., Upadhyaya P. (2014), Hindi viseme recognition using subspace DCT features, International Journal of Applied Pattern Recognition (IJAPR), vol. 1, No. 3, pp. 257–272.

Young S. (2008), HMMS and related speech recognition technologies, Handbook on Speech Processing and Speech Communication, Springer-Verlag Berlin Heidelberg, pp. 539–558.

Zhou Z., Hong X., Zhao G., Pietikainen M. (2014), A compact representation of visual speech data using latent variables, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 1, pp. 181–187.

DOI: 10.1515/aoa-2015-0061

Copyright © Polish Academy of Sciences & Institute of Fundamental Technological Research (IPPT PAN)