Abstrakti
Text-To-Speech (TTS) evaluations are habitually carried out without contextual and situational framing. Since humans adapt their speaking style to situation specific communicative needs, such evaluations may not generalize across situations. Without clearly defined framing, it is even unclear in which situations evaluation results hold at all. We test the hypothesized impact of framing on TTS evaluation in a crowdsourced MOS evaluation of four TTS voices, systematically varying (a) the intended TTS task (domestic humanoid robot, child’s voice replacement, fiction audio books and long and information-rich texts) and (b) the framing of that task. The results show that framing differentiated MOS responses, with individual TTS performance varying significantly across tasks and framings. This corroborates the assumption that decontextualized MOS evaluations do not generalize, and suggests that TTS evaluations should not be reported without the type of framing that was employed, if any.
Alkuperäiskieli | englanti |
---|---|
Otsikko | Proceedings of Interspeech 2024 |
Sivumäärä | 5 |
Julkaisupaikka | Baixas |
Kustantaja | ISCA - International Speech Communication Association |
Julkaisupäivä | 1 syysk. 2024 |
Sivut | 1205-1209 |
DOI - pysyväislinkit | |
Tila | Julkaistu - 1 syysk. 2024 |
OKM-julkaisutyyppi | A4 Artikkeli konferenssijulkaisuussa |
Tapahtuma | Interspeech - Kos, Kreikka Kesto: 1 syysk. 2024 → 5 syysk. 2024 Konferenssinumero: 25 https://interspeech2024.org/ |
Julkaisusarja
Nimi | Interspeech |
---|---|
Kustantaja | ISCA - International Speech Communication Association |
ISSN (elektroninen) | 2958-1796 |
Tieteenalat
- 6161 Fonetiikka
- 6164 Puheviestintä