Prosodic Representations of Prominence Classification Neural Networks and Autoencoders Using Bottleneck Features

Research output: Contribution to journalConference articleScientificpeer-review

Abstract

Prominence perception has been known to correlate with a complex interplay of the acoustic features of energy, fundamental frequency, spectral tilt, and duration. The contribution and importance of each of these features in distinguishing between prominent and non-prominent units in speech is not always easy to determine, and more so, the prosodic representations that humans and automatic classifiers learn have been difficult to interpret. This work focuses on examining the acoustic prosodic representations that binary prominence classification neural networks and autoencoders learn for prominence. We investigate the complex features learned at different layers of the network as well as the 10-dimensional bottleneck features (BNFs), for the standard acoustic prosodic correlates of prominence separately and in combination. We analyze and visualize the BNFs obtained from the prominence classification neural networks as well as their network activations. The experiments are conducted on a corpus of Dutch continuous speech with manually annotated prominence labels. Our results show that the prosodic representations obtained from the BNFs and higher-dimensional non-BNFs provide good separation of the two prominence categories, with, however, different partitioning of the BNF space for the distinct features, and the best overall separation obtained for F0.
Original languageEnglish
JournalAnnual Conference of the International Speech Communication Association (Interspeech)
Publication statusAccepted/In press - Sep 2019
MoE publication typeA4 Article in conference proceedings

Fields of Science

  • 6121 Languages
  • 6161 Phonetics

Cite this

@article{f750fbed458d4cf7ae3539c16a4ede9f,
title = "Prosodic Representations of Prominence Classification Neural Networks and Autoencoders Using Bottleneck Features",
abstract = "Prominence perception has been known to correlate with a complex interplay of the acoustic features of energy, fundamental frequency, spectral tilt, and duration. The contribution and importance of each of these features in distinguishing between prominent and non-prominent units in speech is not always easy to determine, and more so, the prosodic representations that humans and automatic classifiers learn have been difficult to interpret. This work focuses on examining the acoustic prosodic representations that binary prominence classification neural networks and autoencoders learn for prominence. We investigate the complex features learned at different layers of the network as well as the 10-dimensional bottleneck features (BNFs), for the standard acoustic prosodic correlates of prominence separately and in combination. We analyze and visualize the BNFs obtained from the prominence classification neural networks as well as their network activations. The experiments are conducted on a corpus of Dutch continuous speech with manually annotated prominence labels. Our results show that the prosodic representations obtained from the BNFs and higher-dimensional non-BNFs provide good separation of the two prominence categories, with, however, different partitioning of the BNF space for the distinct features, and the best overall separation obtained for F0.",
keywords = "6121 Languages, 6161 Phonetics",
author = "Sofoklis Kakouros and Antti Suni and Juraj Šimko and Martti Vainio",
year = "2019",
month = "9",
language = "English",
journal = "Annual Conference of the International Speech Communication Association (Interspeech)",
publisher = "ISCA - International Speech Communication Association",

}

TY - JOUR

T1 - Prosodic Representations of Prominence Classification Neural Networks and Autoencoders Using Bottleneck Features

AU - Kakouros, Sofoklis

AU - Suni, Antti

AU - Šimko, Juraj

AU - Vainio, Martti

PY - 2019/9

Y1 - 2019/9

N2 - Prominence perception has been known to correlate with a complex interplay of the acoustic features of energy, fundamental frequency, spectral tilt, and duration. The contribution and importance of each of these features in distinguishing between prominent and non-prominent units in speech is not always easy to determine, and more so, the prosodic representations that humans and automatic classifiers learn have been difficult to interpret. This work focuses on examining the acoustic prosodic representations that binary prominence classification neural networks and autoencoders learn for prominence. We investigate the complex features learned at different layers of the network as well as the 10-dimensional bottleneck features (BNFs), for the standard acoustic prosodic correlates of prominence separately and in combination. We analyze and visualize the BNFs obtained from the prominence classification neural networks as well as their network activations. The experiments are conducted on a corpus of Dutch continuous speech with manually annotated prominence labels. Our results show that the prosodic representations obtained from the BNFs and higher-dimensional non-BNFs provide good separation of the two prominence categories, with, however, different partitioning of the BNF space for the distinct features, and the best overall separation obtained for F0.

AB - Prominence perception has been known to correlate with a complex interplay of the acoustic features of energy, fundamental frequency, spectral tilt, and duration. The contribution and importance of each of these features in distinguishing between prominent and non-prominent units in speech is not always easy to determine, and more so, the prosodic representations that humans and automatic classifiers learn have been difficult to interpret. This work focuses on examining the acoustic prosodic representations that binary prominence classification neural networks and autoencoders learn for prominence. We investigate the complex features learned at different layers of the network as well as the 10-dimensional bottleneck features (BNFs), for the standard acoustic prosodic correlates of prominence separately and in combination. We analyze and visualize the BNFs obtained from the prominence classification neural networks as well as their network activations. The experiments are conducted on a corpus of Dutch continuous speech with manually annotated prominence labels. Our results show that the prosodic representations obtained from the BNFs and higher-dimensional non-BNFs provide good separation of the two prominence categories, with, however, different partitioning of the BNF space for the distinct features, and the best overall separation obtained for F0.

KW - 6121 Languages

KW - 6161 Phonetics

M3 - Conference article

JO - Annual Conference of the International Speech Communication Association (Interspeech)

JF - Annual Conference of the International Speech Communication Association (Interspeech)

ER -