Speaker identity information in different parts of the speech signal: Human and machine perceptions

Responsible: Yu Zhang

Duration of the project: Aug. 2019 - Jul. 2021

Project Description:

Humans are by nature voice experts. We produce voices in idiosyncratic ways and are fully capable of perceiving others’ unique vocal productions. However, the relationship between acoustics and perception is yet not fully understood. Past research on voice recognition focused on the time-invariant glottal and vocal tract attributes of speech (e.g. fundamental and formant frequencies) as perceptually useful acoustic cues of individual voice. Recent studies showed that the dynamic nature of speech articulation is also critical in encoding speaker identity. Thus, I propose to look at how speaker-specific information is perceived in the temporal organization of the speech signal. Our previous studies showed that temporal variations in parts of the speech signal corresponding to the mouth-closing gestures (signal negative dynamics hereafter) contain more speaker-specific information and it might be a universal phenomenon independent of language. However, the presence of certain idiosyncratic temporal features tested in acoustic analysis does not necessarily mean that listeners use them as identity cues in voice recognition. The proposed project thus aims to bridge this gap. Using behavioral voice recognition methods to test whether and to what extent voice recognition is facilitated by signal negative dynamics. I also plan to test to what degree parts of the speech signal correlated with negative dynamics can assist automatic speaker recognition (ASR) systems. Since signal negative dynamics contain more speaker-specific information, we can expect that they will provide important complementary information to ASR systems to achieve higher recognition accuracy and robustness. The findings will enrich our understanding of the perceptual mechanisms underlying voice identity processing, and shed light on the evolution of human communication. It transcends the traditional ASR approach and opens up new avenues for developing better ASR systems with higher performance and less time cost.

Publications (or conference presentation):

  1. Zhang, Yu; He, Lei; Kerdpol, Karnthida; Dellwo, Volker (2018) Between-speaker variability in intensity dynamics: the case of Thai *Talk*. The 27th Annual Conference of the International Association for Forensic Phonetics and Acoustics (IAFPA). Huddersfield, UK, July 29–August 1, 2018.
  2. Zhang, Yu; He, Lei; Dellwo, Volker (2019). Speaker individuality in the durational characteristics of voiced intervals: the case of Chinese bi-dialectal speakers. In: International Congress of Phonetic Sciences, Melbourne, Australia, 4–10 August 2019, pp. 3075–3079
  3. He, Lei; Zhang, Yu; Dellwo, Volker (2019). Between-speaker variability and temporal organization of the first formant. Journal of the Acoustical Society of America 145(3), pp. EL209–EL214.

Keywords: speaker individuality, signal dynamics, human performance, machine performance

Funding source(s): Forschungskredit UZH (Candoc): FK-19-069

Partners: Volker Dellwo