Projects per year
Abstract
This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologically annotated and manually normalized to a standard form. The dataset can be used as a test set for dialect identification and dialect-to-standard normalization, for instance. We evaluate the dataset on the normalization task, comparing an existing normalization model built on a spoken dialect corpus and three newly trained models with different architectures. We find that there are significant differences in normalization difficulty between the dialects, and that a character-level statistical machine translation model performs best on the Murreviikko tweet dataset.
Original language | English |
---|---|
Title of host publication | Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023) : Proceedings of the Workshop |
Editors | Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri |
Number of pages | 9 |
Place of Publication | Stroudsburg |
Publisher | The Association for Computational Linguistics |
Publication date | 5 May 2023 |
Pages | 31-39 |
ISBN (Electronic) | 978-1-959429-50-0 |
Publication status | Published - 5 May 2023 |
MoE publication type | A4 Article in conference proceedings |
Event | Workshop on NLP for Similar Languages, Varieties and Dialects - Dubrovnik, Croatia Duration: 5 May 2023 → 6 May 2023 Conference number: 10 https://sites.google.com/view/vardial-2023 |
Fields of Science
- 6121 Languages
- 113 Computer and information sciences
Projects
- 1 Active
-
CorCoDial: Corpus-based computational dialectology: exploiting machine translation techniques to extract, visualize and interpret dialectal patterns
Scherrer, Y. (Project manager), Tiedemann, J. (Project manager), Mickus, T. (Participant), Miletic Haddad, A. (Participant), Psaltaki, E. (Participant), Roemling, D. (Participant), Siewert, J. (Participant) & Siewert, J. (Participant)
Suomen Akatemia Projektilaskutus
01/09/2021 → 31/08/2025
Project: Research Council of Finland: Academy Project
Datasets
-
Murreviikko: an Annotated and Normalized Corpus of Dialectal Finnish Tweets
Kuparinen, O. V. (Creator), Zenodo, 2023
Dataset