An Effective Speaker Clustering Method using UBM and Ultra-Short Training Utterances

Robert HOSSA; Ryszard Andrzej MAKOWSKI

doi:10.1515/aoa-2016-0011

Authors

Robert HOSSA Wrocław University of Technology, Poland
Ryszard Andrzej MAKOWSKI Wrocław University of Technology, Poland

Abstract

The same speech sounds (phones) produced by different speakers can sometimes exhibit significant differences. Therefore, it is essential to use algorithms compensating these differences in ASR systems. Speaker clustering is an attractive solution to the compensation problem, as it does not require long utterances or high computational effort at the recognition stage. The report proposes a clustering method based solely on adaptation of UBM model weights. This solution has turned out to be effective even when using a very short utterance. The obtained improvement of frame recognition quality measured by means of frame error rate is over 5%. It is noteworthy that this improvement concerns all vowels, even though the clustering discussed in this report was based only on the phoneme a. This indicates a strong correlation between the articulation of different vowels, which is probably related to the size of the vocal tract.

Keywords:

automatic speech recognition, interindividual difference compensation, speaker clustering, universal background model, GMM weighting factor adaptation.

References

1. Anderson T.W. (2003), An Introduction to Multivariate Statistical Analysis, 3rd ed., John Wiley & Sons Inc, New York.

2. Basseville M. (1989), Distance measures for signal processing and pattern recognition, Signal Processing, 18, 349–369.

3. Bishop C.M. (2006), Pattern Recognition and Machine Learning, Springer, New York.

4. Chu S.M., Tang H., Huand T.S. (2009a), Locality preserving speaker clustering, Proceedings of IEEE International Conference on Multimedia and Expo, pp. 494–497, Mexico.

5. Chu S.M., Tang H., Huang T.S. (2009b), Fisher-voice and semi-supervised speaker clustering, International Conference on Acoustics, Speech and Signal Processing, pp. 4089–4092, Taipei.

6. De La Torre A., Peinada A.M., Segura J.C., Perez-Cordoba J.L., Benitez M.C., Rubio A.J. (2005), Histogram equalization of speech representation for robust speech recognition, IEEE Transaction on Speech and Audio Processing, 13, 355–366.

7. Duda R., Hart P., Stork D. (2000), Pattern Classification, 2-nd ed., John Wiley & Sons Inc., New York.

8. Hazen T.J. (2000), A comparison of novel techniques for rapid speaker adaptation, Speech Communication, 31, 15–33.

9. He X., Niyogi P. (2003), Locality Preserving Projections, Advances in Neural Information Processing Systems, 16, Vancuver.

10. Iyer A.N., Ofoegbu U.O., Yantorno R.E., Smolinski B.Y. (2006), Blind Speaker Clustering, International Symposium on Intelligent Signal Processing and Communications Systems, pp. 343–346, Yonago.

11. Jassem W. (1973), Fundamentals of Acoustic Phonetics, [in Polish: Podstawy fonetyki akustycznej ], PWN, Warszawa.

12. Kosaka T., Sagayama S. (1994), Tree-structured speaker clustering for fast speaker adaptation, Procedings of International Conference on Acoustics, Speech and Signal Processing, pp. 245–248, Ostendorf.

13. Kuhn R., Junqua J.-C., Nguyen P., Niedzielski N. (2000), Rapid speaker adaptation in eigenvoice space, IEEE Transaction on Speech and Audio Processing, 8, 695–707.

14. Liu D., Kubala F. (2004), Online Speaker Clustering, Procedings of International Conference on Acoustics, Speech and Signal Processing, pp. 333–336, Quebec.

15. Lu Z., Hui Y.V., Lee A.H. (2003), Minimum Hellinger distance estimation for finite Poisson regression models and its applications, Biometrics, 59, 1016–1026.

16. Mehrabani M., Hansen J.H.L. (2013), Singing speaker clustering based on subspace learning in the GMM mean supervector space, Speech Communication, 55, 653–666.

17. Makowski R. (2011), Automatic speech recognition – selected problems, [in Polish: Automatyczne rozpoznawanie mowy – wybrane zagadnienia], Oficyna Wydawnicza Politechniki Wrocławskiej, Wrocław.

18. Makowski R., Hossa R. (2014), Automatic speech signal segmentation based on innovations adaptive filter, International Journal on Applied Mathematics and Computer Science, 24, 259–270.

19. Mrówka P., Makowski R. (2008), Normalization of speaker individual characteristics and compensation of linear transmission distortions in command recognition systems, Archives of Acoustics, 33, 221–242.

20. Naito M., Deng L., Sagisaka Y. (2002), Speaker clustering for speech recognition using vocal track parameters, Speech Communication, 36, 305–315.

21. Reynolds D.A., Rose R.C. (1995), Robust text-independent speaker identification using gaussian mixture speaker models, IEEE Transaction on Speech and Audio Processing, 3, 72–83.

22. Reynolds D.A., Quatieri T.F., Dunn R.B. (2000), Speaker verification using adaptive gaussian mixture models, Digital Signal Processing, 10, 19–41.

23. Stafylakis T., Katsouros V., Carayannis G. (2006), The segmental Bayesian Information Criterion and its applications to Speaker diarization, IEEE Selected Topics in Signal Processing, 4, 857–866.

24. Tang H., Chu S.M., Hasegawa-Johnson M., Huang T.S. (2012), Partially Supervised Speaker Clustering, IEEE Transaction on Pattern Analysis and Machine Intelligence, 34, 959–971.

25. Tranter S., Reynolds D. (2006), An overwiew of Autmatic Speaker Diarization Systems, IEEE Transaction Audio, Speech and Language Processing, 14, 1557–1565.

26. Tsai W-H., Cheng S-S., Wang H-M. (2007), Automatic Speaker Clustering Using a Voice Characteristic Reference Space and Maximum Purity Estimation, IEEE Transaction on Audio, Speech and Language Processing, 15, 1461–1474.

Online first
Early birds
2026, Vol 51
	No 1	No 2
2025, Vol 50
	No 1	No 2	No 3	No 4
2024, Vol 49
	No 1	No 2	No 3	No 4
2023, Vol 48
	No 1	No 2	No 3	No 4
2022, Vol 47
	No 1	No 2	No 3	No 4
2021, Vol 46
	No 1	No 2	No 3	No 4
2020, Vol 45
	No 1	No 2	No 3	No 4
2019, Vol 44
	No 1	No 2	No 3	No 4
2018, Vol 43
	No 1	No 2	No 3	No 4
2017, Vol 42
	No 1	No 2	No 3	No 4
2016, Vol 41
	No 1	No 2	No 3	No 4
2015, Vol 40
	No 1	No 2	No 3	No 4
2014, Vol 39
	No 1	No 2	No 3	No 4
2013, Vol 38
	No 1	No 2	No 3	No 4
2012, Vol 37
	No 1	No 2	No 3	No 4
2011, Vol 36
	No 1	No 2	No 3	No 4
2010, Vol 35
	No 1	No 2	No 3	No 4
2009, Vol 34
	No 1	No 2	No 3	No 4
2008, Vol 33
	No 1	No 2	No 3	No 4	No 4(S)
2007, Vol 32
	No 1	No 2	No 3	No 4	No 4(S)
2006, Vol 31
	No 1	No 2	No 3	No 4	No 4(S)
2005, Vol 30
	No 1	No 2	No 3	No 4
2004, Vol 29
	No 1	No 2	No 3	No 4
2003, Vol 28
	No 1	No 2	No 3	No 4
2002, Vol 27
	No 1	No 2	No 3	No 4
2001, Vol 26
	No 1	No 2	No 3	No 4
2000, Vol 25
	No 1	No 2	No 3	No 4
1999, Vol 24
	No 1	No 2	No 3	No 4
1998, Vol 23
	No 1	No 2	No 3	No 4
1997, Vol 22
	No 1	No 2	No 3	No 4
1996, Vol 21
	No 1	No 2	No 3	No 4
1995, Vol 20
	No 1	No 2	No 3	No 4
1994, Vol 19
	No 1	No 2	No 3	No 4
1993, Vol 18
	No 1	No 2	No 3	No 4
1992, Vol 17
	No 1	No 2	No 3	No 4
1991, Vol 16
	No 1	No 2	No 3-4
1990, Vol 15
	No 1-2		No 3-4
1989, Vol 14
	No 1-2		No 3-4
1988, Vol 13
	No 1-2		No 3-4
1987, Vol 12
	No 1	No 2	No 3-4
1986, Vol 11
	No 1	No 2	No 3	No 4
1985, Vol 10
	No 1	No 2	No 3	No 4
1984, Vol 9
	No 1-2		No 3	No 4
1983, Vol 8
	No 1	No 2	No 3	No 4
1982, Vol 7
	No 1	No 2	No 3-4
1981, Vol 6
	No 1	No 2	No 3	No 4
1980, Vol 5
	No 1	No 2	No 3	No 4
1979, Vol 4
	No 1	No 2	No 3	No 4
1978, Vol 3
	No 1	No 2	No 3	No 4
1977, Vol 2
	No 1	No 2	No 3	No 4
1976, Vol 1
	No 1	No 2	No 3	No 4

An Effective Speaker Clustering Method using UBM and Ultra-Short Training Utterances

Downloads

Authors

Abstract

Keywords:

References

Other articles by the same author(s)

cover

ippt-pan

Issue

Pages

Section

DOI

License

How to Cite

Principal Contact

Address

Support Contact