Murreviikko - A Dialectologically Annotated and Normalized Dataset of Finnish Tweets

Olli Vilhelm Kuparinen

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review

Abstract

This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologically annotated and manually normalized to a standard form. The dataset can be used as a test set for dialect identification and dialect-to-standard normalization, for instance. We evaluate the dataset on the normalization task, comparing an existing normalization model built on a spoken dialect corpus and three newly trained models with different architectures. We find that there are significant differences in normalization difficulty between the dialects, and that a character-level statistical machine translation model performs best on the Murreviikko tweet dataset.
Original languageEnglish
Title of host publicationTenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023) : Proceedings of the Workshop
EditorsYves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Number of pages9
Place of PublicationStroudsburg
PublisherThe Association for Computational Linguistics
Publication date5 May 2023
Pages31-39
ISBN (Electronic)978-1-959429-50-0
Publication statusPublished - 5 May 2023
MoE publication typeA4 Article in conference proceedings
EventWorkshop on NLP for Similar Languages, Varieties and Dialects - Dubrovnik, Croatia
Duration: 5 May 20236 May 2023
Conference number: 10
https://sites.google.com/view/vardial-2023

Fields of Science

  • 6121 Languages
  • 113 Computer and information sciences

Cite this