Archives of Acoustics, 41, 1, pp. 107–118, 2016
10.1515/aoa-2016-0011

An Effective Speaker Clustering Method using UBM and Ultra-Short Training Utterances

Robert HOSSA
Wrocław University of Technology
Poland

Ryszard Andrzej MAKOWSKI
Wrocław University of Technology
Poland

The same speech sounds (phones) produced by different speakers can sometimes exhibit significant differences. Therefore, it is essential to use algorithms compensating these differences in ASR systems. Speaker clustering is an attractive solution to the compensation problem, as it does not require long utterances or high computational effort at the recognition stage. The report proposes a clustering method based solely on adaptation of UBM model weights. This solution has turned out to be effective even when using a very short utterance. The obtained improvement of frame recognition quality measured by means of frame error rate is over 5%.

It is noteworthy that this improvement concerns all vowels, even though the clustering discussed in this report was based only on the phoneme a. This indicates a strong correlation between the articulation of different vowels, which is probably related to the size of the vocal tract.
Keywords: automatic speech recognition; interindividual difference compensation; speaker clustering; universal background model; GMM weighting factor adaptation.
Full Text: PDF

References

Anderson T.W. (2003), An Introduction to Multivariate Statistical Analysis, 3rd ed., John Wiley & Sons Inc, New York.

Basseville M. (1989), Distance measures for signal processing and pattern recognition, Signal Processing, 18, 349–369.

Bishop C.M. (2006), Pattern Recognition and Machine Learning, Springer, New York.

Chu S.M., Tang H., Huand T.S. (2009a), Locality preserving speaker clustering, Proceedings of IEEE International Conference on Multimedia and Expo, pp. 494–497, Mexico.

Chu S.M., Tang H., Huang T.S. (2009b), Fisher-voice and semi-supervised speaker clustering, International Conference on Acoustics, Speech and Signal Processing, pp. 4089–4092, Taipei.

De La Torre A., Peinada A.M., Segura J.C., Perez-Cordoba J.L., Benitez M.C., Rubio A.J. (2005), Histogram equalization of speech representation for robust speech recognition, IEEE Transaction on Speech and Audio Processing, 13, 355–366.

Duda R., Hart P., Stork D. (2000), Pattern Classification, 2-nd ed., John Wiley & Sons Inc., New York.

Hazen T.J. (2000), A comparison of novel techniques for rapid speaker adaptation, Speech Communication, 31, 15–33.

He X., Niyogi P. (2003), Locality Preserving Projections, Advances in Neural Information Processing Systems, 16, Vancuver.

Iyer A.N., Ofoegbu U.O., Yantorno R.E., Smolinski B.Y. (2006), Blind Speaker Clustering, International Symposium on Intelligent Signal Processing and Communications Systems, pp. 343–346, Yonago.

Jassem W. (1973), Fundamentals of Acoustic Phonetics, [in Polish: Podstawy fonetyki akustycznej ], PWN, Warszawa.

Kosaka T., Sagayama S. (1994), Tree-structured speaker clustering for fast speaker adaptation, Procedings of International Conference on Acoustics, Speech and Signal Processing, pp. 245–248, Ostendorf.

Kuhn R., Junqua J.-C., Nguyen P., Niedzielski N. (2000), Rapid speaker adaptation in eigenvoice space, IEEE Transaction on Speech and Audio Processing, 8, 695–707.

Liu D., Kubala F. (2004), Online Speaker Clustering, Procedings of International Conference on Acoustics, Speech and Signal Processing, pp. 333–336, Quebec.

Lu Z., Hui Y.V., Lee A.H. (2003), Minimum Hellinger distance estimation for finite Poisson regression models and its applications, Biometrics, 59, 1016–1026.

Mehrabani M., Hansen J.H.L. (2013), Singing speaker clustering based on subspace learning in the GMM mean supervector space, Speech Communication, 55, 653–666.

Makowski R. (2011), Automatic speech recognition – selected problems, [in Polish: Automatyczne rozpoznawanie mowy – wybrane zagadnienia], Oficyna Wydawnicza Politechniki Wrocławskiej, Wrocław.

Makowski R., Hossa R. (2014), Automatic speech signal segmentation based on innovations adaptive filter, International Journal on Applied Mathematics and Computer Science, 24, 259–270.

Mrówka P., Makowski R. (2008), Normalization of speaker individual characteristics and compensation of linear transmission distortions in command recognition systems, Archives of Acoustics, 33, 221–242.

Naito M., Deng L., Sagisaka Y. (2002), Speaker clustering for speech recognition using vocal track parameters, Speech Communication, 36, 305–315.

Reynolds D.A., Rose R.C. (1995), Robust text-independent speaker identification using gaussian mixture speaker models, IEEE Transaction on Speech and Audio Processing, 3, 72–83.

Reynolds D.A., Quatieri T.F., Dunn R.B. (2000), Speaker verification using adaptive gaussian mixture models, Digital Signal Processing, 10, 19–41.

Stafylakis T., Katsouros V., Carayannis G. (2006), The segmental Bayesian Information Criterion and its applications to Speaker diarization, IEEE Selected Topics in Signal Processing, 4, 857–866.

Tang H., Chu S.M., Hasegawa-Johnson M., Huang T.S. (2012), Partially Supervised Speaker Clustering, IEEE Transaction on Pattern Analysis and Machine Intelligence, 34, 959–971.

Tranter S., Reynolds D. (2006), An overwiew of Autmatic Speaker Diarization Systems, IEEE Transaction Audio, Speech and Language Processing, 14, 1557–1565.

Tsai W-H., Cheng S-S., Wang H-M. (2007), Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation, IEEE Transaction on Audio, Speech and Language Processing, 15, 1461–1474.




DOI: 10.1515/aoa-2016-0011

Copyright © Polish Academy of Sciences & Institute of Fundamental Technological Research (IPPT PAN)