Language Technology Tools for Low-Resource Languages—Five Cases for Sakha, Norwegian, and Finnish

Sardana Ivanova

Tutkimustuotos: OpinnäyteVäitöskirjaArtikkelikokoelma

Abstrakti

This dissertation develops language technology tools for low-resource languages. It is important to ensure that low-resource languages are not left behind in the rapidly evolving digital landscape, as language technology tools can greatly improve communication and information access for speakers of these languages. The support of low-resource languages through technology development and revitalisation efforts is essential for preserving linguistic diversity and maintaining the richness of cultural heritage.

The dissertation presents five case studies for three languages, starting from the truly low-resource Sakha language to the more resourceful languages, Finnish and Norwegian, which still lack many resources available for English. Sakha is a Turkic language spoken in the Republic of Sakha in Siberia by 0.5 million people. Finnish is a Uralic language of the Finnic branch, spoken by 5.8 million people in Finland and by ethnic Finns outside of Finland. Norwegian is a North Germanic language, spoken mainly in Norway by 5.32 million people. The five cases covered in the dissertation range from essential tools for Sakha, such as a morphological analyser, to higher-level tools for Norwegian and Finnish.

The contributions of the dissertation are as follows.

We developed a morphological analyser and generator for Sakha within the framework of two-level morphology. It has a coverage of above 90% and 99% precision. While developing the analyser, we expanded linguistic knowledge about Sakha and devised strategies for complex grammatical patterns.

We implemented a language-learning environment for Sakha in the Revita computer-assisted language-learning platform, using the morphological analyser we developed.

We created a Turkic Interlingua corpus and trained Russian-Sakha, Sakha-Russian, English-Sakha, and Sakha-English machine translation models, as well as a multi-way neural machine translation model. We performed an extensive analysis using automatic metrics as well as human evaluations.

We created NorQuAD—the first Norwegian question-answering dataset for machine reading comprehension. The dataset consists of 4,752 manually created question-answer pairs. We benchmarked several multilingual and Norwegian monolingual language models on the dataset and compared them against human performance.

We developed a method for poetry writing applicable to many languages. We illustrated the method using Finnish as an example. The method involves generating poetry one line at a time using a sequence-to-sequence neural model that has been fine-tuned for this purpose.
Alkuperäiskielienglanti
Valvoja/neuvonantaja
  • Toivonen, Hannu TT, Valvoja
  • Granroth-Wilding, Mark, Valvoja
JulkaisupaikkaHelsinki
Kustantaja
Painoksen ISBN978-952-84-0104-9
Sähköinen ISBN978-952-84-0105-6
TilaJulkaistu - maalisk. 2024
OKM-julkaisutyyppiG5 Tohtorinväitöskirja (artikkeli)

Tieteenalat

  • 113 Tietojenkäsittely- ja informaatiotieteet

Siteeraa tätä