Corpus-based computational dialectology: exploiting machine translation techniques to extract, visualize and interpret dialectal patterns

Projektin yksityiskohdat

Kuvaus (abstrakti)

Dialectology is concerned with the study of language variation across space. While dialect atlases and dictionaries have been produced over the last 150 years for almost all linguistic areas of Europe, recent dialectological research increasingly focuses on corpus-based approaches. However, carrying out quantitative studies with dialect corpora has proven challenging because corpus data are not directly comparable. If informant A does not use word x, this does not necessarily mean that the word does not exist in A’s dialect. It may just be that A chose to talk about topics that did not require the use of word x. This project proposes a new take on corpus-based dialectology that relies on automatic normalization to provide comparability across dialects.
Normalization is defined as the annotation of every dialectal word with a canonical word form, for example the standardized spelling of the word. It disambiguates dialectal word forms and provides a basis of comparison of different dialects. Automatic normalization can be viewed as a particular case of machine translation. The first goal of the project will be to improve current normalization methods with techniques from state-of-the-art neural machine translation.
Normalization introduces comparability in dialect corpora. In particular, the parameters of the normalization models provide a condensed and abstract representation of the normalization process, which allows us, for example, to investigate the status of particular characters in different dialects and to test the validity of traditional dialectal classifications. The second goal of the project will thus be to extract, visualize and interpret dialectal patterns emerging from the normalization models.
The third goal of the project is to investigate to what extent user-generated content (UGC), i.e. texts published by diverse users on social media platforms, contains dialectal signals. We will collect UGC data and contrast them with existing dialect corpora, again using normalization methods to provide comparability.
The experiments will initially focus on Swiss German and Finnish dialects, for which relevant resources are available. We will extend our investigations to other dialect areas yet to be defined. The results of this research, obtained through the unique combination of machine learning methods and spontaneously occurring data, will yield new visualizations of dialect landscapes, showcasing the richness of linguistic variation.

Yleistajuinen kuvaus

Dialectology is concerned with the study of language variation across space. Current dialectological datasets typically consist of interviews with informants. These interviews cannot easily be compared with each other as they differ considerably in length and content. If informant A does not use word x, this does not necessarily mean that the word does not exist in A’s dialect. It may just be that A chose to talk about topics that did not require the use of word x. This project aims to introduce comparability in dialect corpora. In particular, we will use machine translation techniques to normalize the dialect texts, i.e., to transform them to standardized spelling. However, we are not only interested in the result of this normalization process, but also in the transformation operations that the normalization model learns. These model parameters will allow us to provide new visualisations of dialect landscapes and to confirm or challenge traditional dialect classifications.
Lyhennetty nimiCorCoDial
AkronyymiCorCoDial
TilaKäynnissä
Todellinen alku/loppupvm01/09/202131/08/2025