Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations

Forskningsoutput: TidskriftsbidragKonferensartikelVetenskapligPeer review

Sammanfattning

In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models are publicly available.
Originalspråkengelska
TidskriftNordic Conference of Computational Linguistics
Antal sidor9
Status!!Accepted/In press - 9 aug 2019
MoE-publikationstypA4 Artikel i en konferenspublikation

Vetenskapsgrenar

  • 113 Data- och informationsvetenskap
  • 6121 Språkvetenskaper

Citera det här

@article{e7f78b80371d4925a8b666b34de47073,
title = "Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations",
abstract = "In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10{\%} of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models are publicly available.",
keywords = "113 Computer and information sciences, Natural language processing, 6121 Languages",
author = "Aarne Talman and Antti Suni and Hande Celikkanat and Sofoklis Kakouros and J{\"o}rg Tiedemann and Martti Vainio",
year = "2019",
month = "8",
day = "9",
language = "English",
journal = "Nordic Conference of Computational Linguistics",
publisher = "[s.n.]",

}

TY - JOUR

T1 - Predicting Prosodic Prominence from Text with Pre-trained Contextualized Word Representations

AU - Talman, Aarne

AU - Suni, Antti

AU - Celikkanat, Hande

AU - Kakouros, Sofoklis

AU - Tiedemann, Jörg

AU - Vainio, Martti

PY - 2019/8/9

Y1 - 2019/8/9

N2 - In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models are publicly available.

AB - In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models are publicly available.

KW - 113 Computer and information sciences

KW - Natural language processing

KW - 6121 Languages

M3 - Conference article

JO - Nordic Conference of Computational Linguistics

JF - Nordic Conference of Computational Linguistics

ER -