Estimating Ensemble Location and Width in Binaural Recordings of Music with Convolutional Neural Networks
Abstract
Binaural audio technology has been in existence for many years. However, its popularity has significantly increased over the past decade as a consequence of advancements in virtual reality and streaming techniques. Along with its growing popularity, the quantity of publicly accessible binaural audio recordings has also expanded. Consequently, there is now a need for automated and objective retrieval of spatial content information, with ensemble location and width being the most prominent. This study presents a novel method for estimating these ensemble parameters in binaural recordings of music. For this purpose, a dataset of 23 040 binaural recordings was synthesized from 192 publicly-available music recordings using 30 head-related transfer functions. The synthesized excerpts were then used to train a multi-task spectrogram-based convolutional neural network model, aiming to estimate the ensemble location and width for unseen recordings. The results indicate that a model for estimating ensemble parameters can be successfully constructed with low prediction errors: 4.76° (±0.10°) for ensemble location and 8.57° (±0.19°) for ensemble width. The method developed in this study outperforms previous spatiogram-based techniques recently published in the literature and shows promise for future development as part of a novel tool for binaural audio recordings analysis.Keywords:
ensemble width, ensemble location, binaural, spatial audio, localization, convolutional neural network, head-related transfer function, angle of arrivalReferences
1. Abdel-Hamid O., Mohamed A.-r., Jiang H., Penn G. (2012), Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition, [in:] 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280, https://doi.org/10.1109/ICASSP.2012.6288864
2. Algazi V.R., Duda R.O., Thompson D.M., Avendano C. (2001), The CIPIC HRTF database, [in:] Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575), pp. 99–102, https://doi.org/10.1109/ASPAA.2001.969552
3. Andreopoulou A., Begault D.R., Katz B.F.G. (2015), Inter-laboratory round robin HRTF measurement comparison, [in:] IEEE Journal of Selected Topics in Signal Processing, 9(5): 895–906, https://doi.org/10.1109/JSTSP.2015.2400417
4. Antoniuk P. (2024), Software repository: Estimating ensemble location and width in binaural recordings of music with convolutional neural networks, GitHub, https://github.com/pawel-antoniuk/ensemble-width-cnn (access: 07.01.2024).
5. Antoniuk P., Zieliński S.K. (2023), Blind estimation of ensemble width in binaural music recordings using ‘spatiograms’ under simulated anechoic conditions, [in:] Audio Engineering Society Conference: AES 2023 International Conference on Spatial and Immersive Audio.
6. Armstrong C., Thresh L., Murphy D., Kearney G. (2018), A perceptual evaluation of individual and nonindividual HRTFs: A case study of the SADIE II database, Applied Sciences, 8(11): 2029, https://doi.org/10.3390/app8112029
7. Arthi S., Sreenivas T.V. (2021), Spatiogram: A phase based directional angular measure and perceptual weighting for ensemble source width, ArXiv, https://doi.org/10.48550/arXiv.2112.07216
8. Austrian Academy of Sciences (2014), HRTF-Database, https://www.oeaw.ac.at/en/ari/das-institut/software/hrtf-database
9. Benaroya E.L., Obin N., Liuni M., Roebel A., Raumel W., Argentieri S. (2018), Binaural localization of multiple sound sources by non-negative tensor factorization, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(6): 1072–1082, https://doi.org/10.1109/TASLP.2018.2806745
10. Blauert J. (1996), Spatial Hearing: The Psychophysics of Human Sound Localization, The MIT Press, https://doi.org/10.7551/mitpress/6391.001.0001
11. Branke J. (1995), Evolutionary algorithms for neural network design and training, [in:] Proceedings of the First Nordic Workshop on Genetic Algorithms and its Application, pp. 145–163.
12. Braren H.S., Fels J. (2020), A high-resolution individual 3D adult head and torso model for HRTF simulation and validation: HRTF measurement, RWTH Publications, https://doi.org/10.18154/RWTH-2020-06761
13. Bregman A. (1994), Auditory scene analysis: The perceptual organization of sound, The Journal of the Acoustical Society of America, 95(2): 1177–1178, https://doi.org/10.1121/1.408434
14. Brinkmann F., Dinakaran M., Pelzer R., Grosche P., Voss D., Weinzierl S. (2019), A cross-evaluated database of measured and simulated HRTFs including 3D head meshes, anthropometric features, and headphone impulse responses, Journal of the Audio Engineering Society, 67(9): 705–718, https://doi.org/10.17743/jaes.2019.0024
15. Brinkmann F. et al. (2017), A high resolution and full-spherical head-related transfer function database for different head-above-torso orientations, Journal of the Audio Engineering Society, 65(10): 841–848, https://doi.org/10.17743/jaes.2017.0033
16. Cherry E.C. (1953), Some experiments on the recognition of speech, with one and with two ears, The Journal of the Acoustical Society of America, 25(5): 975–979, https://doi.org/10.1121/1.1907229
17. Chollet F. et al. (2015), Keras, GitHub, https://github.com/fchollet/keras (access: 07.01.2024).
18. Chung M.-A., Chou H.-C., Lin C.-W. (2022), Sound localization based on acoustic source using multiple microphone array in an indoor environment, Electronics, 11(6): 890, https://doi.org/10.3390/electronics11060890
19. Clifton R.K., Gwiazda J., Bauer J.A., Clarkson M.G., Held R.M. (1988), Growth in head size during infancy: Implications for sound localization, Developmental Psychology, 24(4): 477–483, https://doi.org/10.1037/0012-1649.24.4.477
20. Dietz M., Ewert S.D., Hohmann V. (2011), Auditory model based direction estimation of concurrent speakers from binaural signals, Speech Communication, 53(5): 592–605, https://doi.org/10.1016/j.specom.2010.05.006
21. Eisenman A. et al. (2020), Check-N-Run: A checkpointing system for training recommendation models, ArXiv.
22. Espi M., Fujimoto M., Kinoshita K., Nakatani T. (2015), Exploiting spectro-temporal locality in deep learning based acoustic event detection, EURASIP Journal on Audio, Speech, and Music Processing, 2015: 26, https://doi.org/10.1186/s13636-015-0069-2
23. Gardner B., Martin K. (1994), HRTF Measurements of a KEMAR dummy-head microphone, https://sound.media.mit.edu/resources/KEMAR.html (access: 06.19.2024).
24. Garofolo J.S., Lamel L., Fisher W.M., Fiscus J.G., Pallett D.S., Dahlgren N.L. (1993), DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus CD-ROM, NIST Speech Disc 1-1.1, NIST Publications, https://doi.org/10.6028/NIST.IR.4930
25. Hahmann M., Fernandez-Grande E., Gunawan H., Gerstoft P. (2022), Sound source localization using multiple ad hoc distributed microphone arrays, JASA Express Letters, 2(7): 074801, https://doi.org/10.1121/10.0011811
26. Han Y., Park J., Lee K. (2017), Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification, [in:] Workshop on Detection and Classification of Acoustic Scenes and Events.
27. Hirsh I.J. (1950), Binaural hearing aids: A review of some experiments, Journal of Speech and Hearing Disorders, 15(2): 114–123, https://doi.org/10.1044/jshd.1502.114
28. Ioffe S., Szegedy C. (2015), Batch normalization: Accelerating deep network training by reducing internal covariate shift, [in:] Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456.
29. ITU (2023), BS.1770: Algorithms to measure audio programme loudness and true-peak audio level, International Communications Union, Geneva, Switzerland.
30. Kaveh M., Barabell A. (1986), The statistical performance of the MUSIC and the minimum-norm algorithms in resolving plane waves in noise, [in:] IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(2): 331–341, https://doi.org/10.1109/TASSP.1986.1164815
31. King A.J., Kacelnik O., Mrsic-Flogel T.D., Schnupp J.W., Parsons C.H., Moore D.R. (2001), How plastic is spatial hearing?, Audiology and Neurotology, 6(4): 182–186, https://doi.org/10.1159/000046829
32. Kingma D.P., Ba J. (2014), Adam: A method for stochastic optimization, [in:] International Conference on Learning Representations.
33. Krizhevsky A., Sutskever I., Hinton G.E. (2012), ImageNet classification with deep convolutional neural networks, [in:] Advances in Neural Information Processing Systems 25 (NIPS 2012), 25.
34. Kuhn M., Johnson K. (2013), Applied Predictive Modeling, Springer, New York, https://doi.org/10.1007/978-1-4614-6849-3
35. LeCun Y. et al. (1989), Handwritten digit recognition with a back-propagation network, [in:] Advances in Neural Information Processing Systems 2 (NIPS 1989), 2.
36. Lin M., Chen Q., Yan S. (2013), Network in network, [in:] International Conference on Learning Representations.
37. Listen HRTF Database (n.d.), http://recherche.ircam.fr/equipes/salles/listen/ (access: 06.19.2024).
38. Liu M., Hu J., Zeng Q., Jian Z., Nie L. (2022), Sound source localization based on multi-channel cross-correlation weighted beamforming, Micromachines, 13(7): 1010, https://doi.org/10.3390/mi13071010
39. Liu Q., Wang W., de Campos T., Jackson P.J.B., Hilton A. (2018), Multiple speaker tracking in spatial audio via PHD filtering and depth-audio fusion, [in:] IEEE Transactions on Multimedia, 20(7): 1767–1780, https://doi.org/10.1109/TMM.2017.2777671
40. Ma N., Brown G.J. (2016), Speech localisation in a multitalker mixture by humans and machines, [in:] Interspeech 2016, pp. 3359–3363, https://doi.org/10.21437/Interspeech.2016-1149
41. Ma N., Gonzalez J.A., Brown G.J. (2018), Robust binaural localization of a target sound source by combining spectral source models and deep neural networks, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11): 2122–2131, https://doi.org/10.1109/TASLP.2018.2855960
42. Ma N., May T., Brown G.J. (2017), Exploiting deep neural networks and head movements for robust binaural localisation of multiple sources in reverberant environments, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12): 2444–2453, https://doi.org/10.1109/TASLP.2017.2750760
43. May T., Ma N., Brown G.J. (2015), Robust localisation of multiple speakers exploiting head movements and multi-conditional training of binaural cues, [in:] 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2679–2683, https://doi.org/10.1109/ICASSP.2015.7178457
44. May T., van de Par S., Kohlrausch A. (2011), A probabilistic model for robust localization based on a binaural auditory front-end, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 19(1): 1–13, https://doi.org/10.1109/TASL.2010.2042128
45. May T., van de Par S., Kohlrausch A. (2012), A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 20(7): 2016–2030, https://doi.org/10.1109/TASL.2012.2193391
46. Miikkulainen R. et al. (2017), Evolving deep neural networks, ArXiv.
47. Morgan N., Bourlard H. (1989), Generalization and parameter estimation in feedforward nets: Some experiments, [in:] Advances in Neural Information Processing Systems 2 (NIPS 1989).
48. Pan Z., Zhang M.,Wu J., Wang J., Li H. (2021), Multi-tone phase coding of interaural time difference for sound source localization with spiking neural networks, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 2656–2670, https://doi.org/10.1109/TASLP.2021.3100684
49. Pang C., Liu H., Li X. (2019), Multitask learning of time-frequency CNN for sound source localization, [in:] IEEE Access, 7: 40725–40737, https://doi.org/10.1109/ACCESS.2019.2905617
50. Pavlidi D., Puigt M., Griffin A., Mouchtaris A. (2012), Real-time multiple sound source localization using a circular microphone array based on single-source confidence measures, [in:] 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2625–2628, https://doi.org/10.1109/ICASSP.2012.6288455
51. Pocock S.J., Hughes M.D. (1989), Practical problems in interim analyses, with particular regard to estimation, Controlled Clinical Trials, 10(4): 209–221, https://doi.org/10.1016/0197-2456(89)90059-7
52. Porschmann C., Arend J., Neidhardt A. (2017), A spherical near-field HRTF set for auralization and psychoacoustic research, [in:] Proceedings of the 142nd AES Convention.
53. Raake A. (2016), A computational framework for modelling active exploratory listening that assigns meaning to auditory scenes – Reading the world with two ears, Two!Ears, http://twoears.eu (access: 06.11.2024).
54. Rumsey F. (2002), Spatial quality evaluation for reproduced sound: Terminology, meaning, and a scene-based paradigm, Journal of the Audio Engineering Society, 50(9): 651–666.
55. Sainath T.N., Mohamed A.-r., Kingsbury B., Ramabhadran B. (2013), Deep convolutional neural networks for LVCSR, [in:] 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618, https://doi.org/10.1109/ICASSP.2013.6639347
56. Senior M. (2023), The ’Mixing Secrets’ Free Multitrack Download Library, Cambridge Music Technology, https://cambridge-mt.com/ms/mtk/ (06.10.2024).
57. Shafiee M.J., Mishra A., Wong A. (2016), Deep learning with Darwin: Evolutionary synthesis of deep neural networks, Neural Processing Letters, 48: 603–613, https://doi.org/10.1007/s11063-017-9733-0
58. Spagnol S., Miccini R., Unnthórsson R. (2020), The Viking HRTF Dataset v2.
59. Spagnol S., Purkhus K.B., Unnthórsson R., Bjornsson S.K. (2019), The Viking HRTF Dataset.
60. Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. (2014), Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research, 15(56): 1929–1958, http://jmlr.org/papers/v15/srivastava14a.html
61. Stanley K.O., Miikkulainen R. (2002), Evolving neural networks through augmenting topologies, Evolutionary Computation, 10(2): 99–127, https://doi.org/10.1162/106365602320169811
62. The MathWorks Inc. (2022a), Audio Toolbox, Version: 9.13.0 (R2022b), Natick, Massachusetts, United States, https://www.mathworks.com
63. The MathWorks Inc. (2022b), MATLAB, Version: 9.13.0 (R2022b), Natick, Massachusetts, United States, https://www.mathworks.com
64. Thiemann J., Muller M., Marquardt D., Doclo S., van de Par S. (2016), Speech enhancement for multimicrophone binaural hearing aids aiming to preserve the spatial auditory scene, [in:] EURASIP Journal on Advances in Signal Processing, 2016(1), https://doi.org/10.1186/s13634-016-0314-6
65. Thomas S., Ganapathy S., Saon G., Soltau H. (2014), Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions, [in:] 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2519–2523, https://doi.org/10.1109/ICASSP.2014.6854054
66. Van Rossum G., Drake F.L. (2009), Python 3 Reference Manual, Scotts Valley, CA: CreateSpace.
67. Vecchiotti P., Ma N., Squartini S., Brown G.J. (2019), End-to-end binaural sound localisation from the raw waveform, [in:] ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 451–455, https://doi.org/10.1109/ICASSP.2019.8683732
68. Vera-Diaz J.M., Pizarro D., Macias-Guarasa J. (2018), Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates, Sensors, 18(10): 3418, https://doi.org/10.3390/s18103418
69. Virtanen P. et al. (2020), SciPy 1.0: Fundamental algorithms for scientific computing in Python, Nature Methods, 17: 261–272, https://doi.org/10.1038/s41592-019-0686-2
70. Wang J., Wang J., Qian K., Xie X., Kuang J. (2020), Binaural sound localization based on deep neural network and affinity propagation clustering in mismatched HRTF condition, EURASIP Journal on Audio, Speech, and Music Processing, 2020, https://doi.org/10.1186/s13636-020-0171-y
71. Watanabe K., Iwaya Y., Suzuki Y., Takane S., Sato S. (2014), Dataset of head-related transfer functions measured with a circular loudspeaker array, Acoustical Science and Technology, 35(3): 159–165, https://doi.org/10.1250/ast.35.159
72. Wierstorf H., Geier M., Raake A., Spors S. (2011), A free database of head-related impulse response measurements in the horizontal plane with multiple distances, [in:] 130th Convention. Engineering Brief. Audio Engineering Society.
73. Woodruff J., Wang D. (2012), Binaural localization of multiple sources in reverberant and noisy environment, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 20(5): 1503–1512, https://doi.org/10.1109/TASL.2012.2183869
74. Yang Q., Zheng Y. (2022), DeepEar: Sound localization with binaural microphones, [in:] IEEE INFOCOM 2022 – IEEE Conference on Computer Communications, pp. 960–969, https://doi.org/10.1109/INFOCOM48880.2022.9796850
75. Yu G., Wu R., Liu Y., Xie B. (2018), Near-field head-related transfer-function measurement and database of human subjects, The Journal of the Acoustical Society of America, 143(3): EL194–EL198, https://doi.org/10.1121/1.5027019
76. Zhang H., Kiranyaz S., Gabbouj M. (2018), Finding better topologies for deep convolutional neural networks by evolution, ArXiv, https://doi.org/10.48550/arXiv.1809.03242
77. Zhang W., Samarasinghe P.N., Chen H., Abhayapala T.D. (2017), Surround by sound: A review of spatial audio recording and reproduction, Applied Sciences, 7(5): 532, https://doi.org/10.3390/app7050532
78. Zieliński S.K., Antoniuk P., Lee H. (2022a), Spatial audio scene characterization (SASC): Automatic localization of front-, back-, up-, and down-positioned music ensembles in binaural recordings, Applied Sciences, 12(3): 1569, https://doi.org/10.3390/app12031569
79. Zieliński S.K., Antoniuk P., Lee H., Johnson D. (2022b), Automatic discrimination between front and back ensemble locations in HRTF-convolved binaural recordings of music, EURASIP Journal on Audio, Speech, and Music Processing, 2022(1): 3, https://doi.org/10.1186/s13636-021-00235-2
80. Zieliński S.K., Lee H., Antoniuk P., Dadan O. (2020), A comparison of human against machine-classification of spatial audio scenes in binaural recordings of music, Applied Sciences, 10(17): 5956, https://doi.org/10.3390/app10175956
2. Algazi V.R., Duda R.O., Thompson D.M., Avendano C. (2001), The CIPIC HRTF database, [in:] Proceedings of the 2001 IEEE Workshop on the Applications of Signal Processing to Audio and Acoustics (Cat. No.01TH8575), pp. 99–102, https://doi.org/10.1109/ASPAA.2001.969552
3. Andreopoulou A., Begault D.R., Katz B.F.G. (2015), Inter-laboratory round robin HRTF measurement comparison, [in:] IEEE Journal of Selected Topics in Signal Processing, 9(5): 895–906, https://doi.org/10.1109/JSTSP.2015.2400417
4. Antoniuk P. (2024), Software repository: Estimating ensemble location and width in binaural recordings of music with convolutional neural networks, GitHub, https://github.com/pawel-antoniuk/ensemble-width-cnn (access: 07.01.2024).
5. Antoniuk P., Zieliński S.K. (2023), Blind estimation of ensemble width in binaural music recordings using ‘spatiograms’ under simulated anechoic conditions, [in:] Audio Engineering Society Conference: AES 2023 International Conference on Spatial and Immersive Audio.
6. Armstrong C., Thresh L., Murphy D., Kearney G. (2018), A perceptual evaluation of individual and nonindividual HRTFs: A case study of the SADIE II database, Applied Sciences, 8(11): 2029, https://doi.org/10.3390/app8112029
7. Arthi S., Sreenivas T.V. (2021), Spatiogram: A phase based directional angular measure and perceptual weighting for ensemble source width, ArXiv, https://doi.org/10.48550/arXiv.2112.07216
8. Austrian Academy of Sciences (2014), HRTF-Database, https://www.oeaw.ac.at/en/ari/das-institut/software/hrtf-database
9. Benaroya E.L., Obin N., Liuni M., Roebel A., Raumel W., Argentieri S. (2018), Binaural localization of multiple sound sources by non-negative tensor factorization, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(6): 1072–1082, https://doi.org/10.1109/TASLP.2018.2806745
10. Blauert J. (1996), Spatial Hearing: The Psychophysics of Human Sound Localization, The MIT Press, https://doi.org/10.7551/mitpress/6391.001.0001
11. Branke J. (1995), Evolutionary algorithms for neural network design and training, [in:] Proceedings of the First Nordic Workshop on Genetic Algorithms and its Application, pp. 145–163.
12. Braren H.S., Fels J. (2020), A high-resolution individual 3D adult head and torso model for HRTF simulation and validation: HRTF measurement, RWTH Publications, https://doi.org/10.18154/RWTH-2020-06761
13. Bregman A. (1994), Auditory scene analysis: The perceptual organization of sound, The Journal of the Acoustical Society of America, 95(2): 1177–1178, https://doi.org/10.1121/1.408434
14. Brinkmann F., Dinakaran M., Pelzer R., Grosche P., Voss D., Weinzierl S. (2019), A cross-evaluated database of measured and simulated HRTFs including 3D head meshes, anthropometric features, and headphone impulse responses, Journal of the Audio Engineering Society, 67(9): 705–718, https://doi.org/10.17743/jaes.2019.0024
15. Brinkmann F. et al. (2017), A high resolution and full-spherical head-related transfer function database for different head-above-torso orientations, Journal of the Audio Engineering Society, 65(10): 841–848, https://doi.org/10.17743/jaes.2017.0033
16. Cherry E.C. (1953), Some experiments on the recognition of speech, with one and with two ears, The Journal of the Acoustical Society of America, 25(5): 975–979, https://doi.org/10.1121/1.1907229
17. Chollet F. et al. (2015), Keras, GitHub, https://github.com/fchollet/keras (access: 07.01.2024).
18. Chung M.-A., Chou H.-C., Lin C.-W. (2022), Sound localization based on acoustic source using multiple microphone array in an indoor environment, Electronics, 11(6): 890, https://doi.org/10.3390/electronics11060890
19. Clifton R.K., Gwiazda J., Bauer J.A., Clarkson M.G., Held R.M. (1988), Growth in head size during infancy: Implications for sound localization, Developmental Psychology, 24(4): 477–483, https://doi.org/10.1037/0012-1649.24.4.477
20. Dietz M., Ewert S.D., Hohmann V. (2011), Auditory model based direction estimation of concurrent speakers from binaural signals, Speech Communication, 53(5): 592–605, https://doi.org/10.1016/j.specom.2010.05.006
21. Eisenman A. et al. (2020), Check-N-Run: A checkpointing system for training recommendation models, ArXiv.
22. Espi M., Fujimoto M., Kinoshita K., Nakatani T. (2015), Exploiting spectro-temporal locality in deep learning based acoustic event detection, EURASIP Journal on Audio, Speech, and Music Processing, 2015: 26, https://doi.org/10.1186/s13636-015-0069-2
23. Gardner B., Martin K. (1994), HRTF Measurements of a KEMAR dummy-head microphone, https://sound.media.mit.edu/resources/KEMAR.html (access: 06.19.2024).
24. Garofolo J.S., Lamel L., Fisher W.M., Fiscus J.G., Pallett D.S., Dahlgren N.L. (1993), DARPA TIMIT: Acoustic-Phonetic Continuous Speech Corpus CD-ROM, NIST Speech Disc 1-1.1, NIST Publications, https://doi.org/10.6028/NIST.IR.4930
25. Hahmann M., Fernandez-Grande E., Gunawan H., Gerstoft P. (2022), Sound source localization using multiple ad hoc distributed microphone arrays, JASA Express Letters, 2(7): 074801, https://doi.org/10.1121/10.0011811
26. Han Y., Park J., Lee K. (2017), Convolutional neural networks with binaural representations and background subtraction for acoustic scene classification, [in:] Workshop on Detection and Classification of Acoustic Scenes and Events.
27. Hirsh I.J. (1950), Binaural hearing aids: A review of some experiments, Journal of Speech and Hearing Disorders, 15(2): 114–123, https://doi.org/10.1044/jshd.1502.114
28. Ioffe S., Szegedy C. (2015), Batch normalization: Accelerating deep network training by reducing internal covariate shift, [in:] Proceedings of the 32nd International Conference on Machine Learning, pp. 448–456.
29. ITU (2023), BS.1770: Algorithms to measure audio programme loudness and true-peak audio level, International Communications Union, Geneva, Switzerland.
30. Kaveh M., Barabell A. (1986), The statistical performance of the MUSIC and the minimum-norm algorithms in resolving plane waves in noise, [in:] IEEE Transactions on Acoustics, Speech, and Signal Processing, 34(2): 331–341, https://doi.org/10.1109/TASSP.1986.1164815
31. King A.J., Kacelnik O., Mrsic-Flogel T.D., Schnupp J.W., Parsons C.H., Moore D.R. (2001), How plastic is spatial hearing?, Audiology and Neurotology, 6(4): 182–186, https://doi.org/10.1159/000046829
32. Kingma D.P., Ba J. (2014), Adam: A method for stochastic optimization, [in:] International Conference on Learning Representations.
33. Krizhevsky A., Sutskever I., Hinton G.E. (2012), ImageNet classification with deep convolutional neural networks, [in:] Advances in Neural Information Processing Systems 25 (NIPS 2012), 25.
34. Kuhn M., Johnson K. (2013), Applied Predictive Modeling, Springer, New York, https://doi.org/10.1007/978-1-4614-6849-3
35. LeCun Y. et al. (1989), Handwritten digit recognition with a back-propagation network, [in:] Advances in Neural Information Processing Systems 2 (NIPS 1989), 2.
36. Lin M., Chen Q., Yan S. (2013), Network in network, [in:] International Conference on Learning Representations.
37. Listen HRTF Database (n.d.), http://recherche.ircam.fr/equipes/salles/listen/ (access: 06.19.2024).
38. Liu M., Hu J., Zeng Q., Jian Z., Nie L. (2022), Sound source localization based on multi-channel cross-correlation weighted beamforming, Micromachines, 13(7): 1010, https://doi.org/10.3390/mi13071010
39. Liu Q., Wang W., de Campos T., Jackson P.J.B., Hilton A. (2018), Multiple speaker tracking in spatial audio via PHD filtering and depth-audio fusion, [in:] IEEE Transactions on Multimedia, 20(7): 1767–1780, https://doi.org/10.1109/TMM.2017.2777671
40. Ma N., Brown G.J. (2016), Speech localisation in a multitalker mixture by humans and machines, [in:] Interspeech 2016, pp. 3359–3363, https://doi.org/10.21437/Interspeech.2016-1149
41. Ma N., Gonzalez J.A., Brown G.J. (2018), Robust binaural localization of a target sound source by combining spectral source models and deep neural networks, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(11): 2122–2131, https://doi.org/10.1109/TASLP.2018.2855960
42. Ma N., May T., Brown G.J. (2017), Exploiting deep neural networks and head movements for robust binaural localisation of multiple sources in reverberant environments, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(12): 2444–2453, https://doi.org/10.1109/TASLP.2017.2750760
43. May T., Ma N., Brown G.J. (2015), Robust localisation of multiple speakers exploiting head movements and multi-conditional training of binaural cues, [in:] 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2679–2683, https://doi.org/10.1109/ICASSP.2015.7178457
44. May T., van de Par S., Kohlrausch A. (2011), A probabilistic model for robust localization based on a binaural auditory front-end, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 19(1): 1–13, https://doi.org/10.1109/TASL.2010.2042128
45. May T., van de Par S., Kohlrausch A. (2012), A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 20(7): 2016–2030, https://doi.org/10.1109/TASL.2012.2193391
46. Miikkulainen R. et al. (2017), Evolving deep neural networks, ArXiv.
47. Morgan N., Bourlard H. (1989), Generalization and parameter estimation in feedforward nets: Some experiments, [in:] Advances in Neural Information Processing Systems 2 (NIPS 1989).
48. Pan Z., Zhang M.,Wu J., Wang J., Li H. (2021), Multi-tone phase coding of interaural time difference for sound source localization with spiking neural networks, [in:] IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29: 2656–2670, https://doi.org/10.1109/TASLP.2021.3100684
49. Pang C., Liu H., Li X. (2019), Multitask learning of time-frequency CNN for sound source localization, [in:] IEEE Access, 7: 40725–40737, https://doi.org/10.1109/ACCESS.2019.2905617
50. Pavlidi D., Puigt M., Griffin A., Mouchtaris A. (2012), Real-time multiple sound source localization using a circular microphone array based on single-source confidence measures, [in:] 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2625–2628, https://doi.org/10.1109/ICASSP.2012.6288455
51. Pocock S.J., Hughes M.D. (1989), Practical problems in interim analyses, with particular regard to estimation, Controlled Clinical Trials, 10(4): 209–221, https://doi.org/10.1016/0197-2456(89)90059-7
52. Porschmann C., Arend J., Neidhardt A. (2017), A spherical near-field HRTF set for auralization and psychoacoustic research, [in:] Proceedings of the 142nd AES Convention.
53. Raake A. (2016), A computational framework for modelling active exploratory listening that assigns meaning to auditory scenes – Reading the world with two ears, Two!Ears, http://twoears.eu (access: 06.11.2024).
54. Rumsey F. (2002), Spatial quality evaluation for reproduced sound: Terminology, meaning, and a scene-based paradigm, Journal of the Audio Engineering Society, 50(9): 651–666.
55. Sainath T.N., Mohamed A.-r., Kingsbury B., Ramabhadran B. (2013), Deep convolutional neural networks for LVCSR, [in:] 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618, https://doi.org/10.1109/ICASSP.2013.6639347
56. Senior M. (2023), The ’Mixing Secrets’ Free Multitrack Download Library, Cambridge Music Technology, https://cambridge-mt.com/ms/mtk/ (06.10.2024).
57. Shafiee M.J., Mishra A., Wong A. (2016), Deep learning with Darwin: Evolutionary synthesis of deep neural networks, Neural Processing Letters, 48: 603–613, https://doi.org/10.1007/s11063-017-9733-0
58. Spagnol S., Miccini R., Unnthórsson R. (2020), The Viking HRTF Dataset v2.
59. Spagnol S., Purkhus K.B., Unnthórsson R., Bjornsson S.K. (2019), The Viking HRTF Dataset.
60. Srivastava N., Hinton G., Krizhevsky A., Sutskever I., Salakhutdinov R. (2014), Dropout: A simple way to prevent neural networks from overfitting, Journal of Machine Learning Research, 15(56): 1929–1958, http://jmlr.org/papers/v15/srivastava14a.html
61. Stanley K.O., Miikkulainen R. (2002), Evolving neural networks through augmenting topologies, Evolutionary Computation, 10(2): 99–127, https://doi.org/10.1162/106365602320169811
62. The MathWorks Inc. (2022a), Audio Toolbox, Version: 9.13.0 (R2022b), Natick, Massachusetts, United States, https://www.mathworks.com
63. The MathWorks Inc. (2022b), MATLAB, Version: 9.13.0 (R2022b), Natick, Massachusetts, United States, https://www.mathworks.com
64. Thiemann J., Muller M., Marquardt D., Doclo S., van de Par S. (2016), Speech enhancement for multimicrophone binaural hearing aids aiming to preserve the spatial auditory scene, [in:] EURASIP Journal on Advances in Signal Processing, 2016(1), https://doi.org/10.1186/s13634-016-0314-6
65. Thomas S., Ganapathy S., Saon G., Soltau H. (2014), Analyzing convolutional neural networks for speech activity detection in mismatched acoustic conditions, [in:] 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2519–2523, https://doi.org/10.1109/ICASSP.2014.6854054
66. Van Rossum G., Drake F.L. (2009), Python 3 Reference Manual, Scotts Valley, CA: CreateSpace.
67. Vecchiotti P., Ma N., Squartini S., Brown G.J. (2019), End-to-end binaural sound localisation from the raw waveform, [in:] ICASSP 2019 – 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 451–455, https://doi.org/10.1109/ICASSP.2019.8683732
68. Vera-Diaz J.M., Pizarro D., Macias-Guarasa J. (2018), Towards end-to-end acoustic localization using deep learning: From audio signals to source position coordinates, Sensors, 18(10): 3418, https://doi.org/10.3390/s18103418
69. Virtanen P. et al. (2020), SciPy 1.0: Fundamental algorithms for scientific computing in Python, Nature Methods, 17: 261–272, https://doi.org/10.1038/s41592-019-0686-2
70. Wang J., Wang J., Qian K., Xie X., Kuang J. (2020), Binaural sound localization based on deep neural network and affinity propagation clustering in mismatched HRTF condition, EURASIP Journal on Audio, Speech, and Music Processing, 2020, https://doi.org/10.1186/s13636-020-0171-y
71. Watanabe K., Iwaya Y., Suzuki Y., Takane S., Sato S. (2014), Dataset of head-related transfer functions measured with a circular loudspeaker array, Acoustical Science and Technology, 35(3): 159–165, https://doi.org/10.1250/ast.35.159
72. Wierstorf H., Geier M., Raake A., Spors S. (2011), A free database of head-related impulse response measurements in the horizontal plane with multiple distances, [in:] 130th Convention. Engineering Brief. Audio Engineering Society.
73. Woodruff J., Wang D. (2012), Binaural localization of multiple sources in reverberant and noisy environment, [in:] IEEE Transactions on Audio, Speech, and Language Processing, 20(5): 1503–1512, https://doi.org/10.1109/TASL.2012.2183869
74. Yang Q., Zheng Y. (2022), DeepEar: Sound localization with binaural microphones, [in:] IEEE INFOCOM 2022 – IEEE Conference on Computer Communications, pp. 960–969, https://doi.org/10.1109/INFOCOM48880.2022.9796850
75. Yu G., Wu R., Liu Y., Xie B. (2018), Near-field head-related transfer-function measurement and database of human subjects, The Journal of the Acoustical Society of America, 143(3): EL194–EL198, https://doi.org/10.1121/1.5027019
76. Zhang H., Kiranyaz S., Gabbouj M. (2018), Finding better topologies for deep convolutional neural networks by evolution, ArXiv, https://doi.org/10.48550/arXiv.1809.03242
77. Zhang W., Samarasinghe P.N., Chen H., Abhayapala T.D. (2017), Surround by sound: A review of spatial audio recording and reproduction, Applied Sciences, 7(5): 532, https://doi.org/10.3390/app7050532
78. Zieliński S.K., Antoniuk P., Lee H. (2022a), Spatial audio scene characterization (SASC): Automatic localization of front-, back-, up-, and down-positioned music ensembles in binaural recordings, Applied Sciences, 12(3): 1569, https://doi.org/10.3390/app12031569
79. Zieliński S.K., Antoniuk P., Lee H., Johnson D. (2022b), Automatic discrimination between front and back ensemble locations in HRTF-convolved binaural recordings of music, EURASIP Journal on Audio, Speech, and Music Processing, 2022(1): 3, https://doi.org/10.1186/s13636-021-00235-2
80. Zieliński S.K., Lee H., Antoniuk P., Dadan O. (2020), A comparison of human against machine-classification of spatial audio scenes in binaural recordings of music, Applied Sciences, 10(17): 5956, https://doi.org/10.3390/app10175956

