Predictive Processing Approach to Modelling Prosodic Hierarchy for Speech Synthesis

Project: Research Council of Finland: Academy Project

Project Details

Description (abstract)

Speech communication profoundly relies on hierarchically organized melodic, rhythmical, and temporal indices to the information relevant in the given situation and context, that fall under common descriptor of prosody. Speech prosody maintain context-dependent cohesion and coherence of speech communication by encoding a wide array of linguistic, paralinguistic and extralinguistic features in a parallel and interconnected fashion.
Speech synthesis technology has recently achieved almost human-like quality of synthesized output, primarily for prosodically neutral utterances. Despite this progress, the state-of-the-art systems fall short of reproducing prosodic characteristics ubiquitous in human speech. This shortcoming arises primarily from architectural and conceptual decisions failing to recognize the hierarchical nature of prosody as a crucial feature of spoken interaction.
Here we propose a novel speech prosody modelling architecture, and its implementation within a speech synthesis system. The architecture explicitly uses hierarchically encoded prosodic information and is an instantiation of the highly influential Predictive Processing cognitive modelling paradigm within the domain of speech communication. Unifying the treatment of speech perception and production within a single system allows for quantification of high-level parameters capturing aspects of rudimentary situation- awareness, including a conversational setting. The use of an ecologically and cognitively grounded model in connection with a linguistically valid descriptive framework is believed to provide more explanatory power than either entirely data-oriented or purely linguistically motivated treatments of prosody in use today.
A parallel objective of the project is to contribute to our theoretical understanding of features and wide-range interdependencies that shape speech prosody and give rise to its context-sensitive realization readily achieved by humans. For this task, the developed deep- learning platform will serve as a complex statistical model capturing representations that guide our communicative behaviour in terms of interactions between various hierarchically organized prosodic units. The development and validation of both technological and theoretical platforms will be assisted by the wide expertise in speech technology and prosodic analysis provided by the host research team and our academic partners.
StatusActive
Effective start/end date01/09/202331/08/2027