Projects per year
Abstract
GiellaLT 1 provides an infrastructure for rule-based language technology aimed at minority and indigenous languages, and streamlines building anything from keyboards to speech technology. The web site of GiellaLT offers language models (transducers) 2 for a wide range of languages. Writing documentation for each language repository is an ongoing effort, and part of the development process. The author has actively participated in the development of open-source, rule-based for a majority of these Uralic target languages.
Analyzer enhancement
The GiellaLT infrastructure, with its implementation of finite-state tools, allows people working with different languages to make use of technological solutions that, otherwise, might require several years of individual development. It is here that descriptions for many of the Uralic languages have been initialized and developed as both financed projects and the work of language technology enthusiasts. The GiellaLT infrastructure makes it possible to reuse finite-state descriptions and even encourages it. Thus, contributing to the enhancement of the finite-state tools at GiellaLT, when extending the annotation of corpora on the Language Bank of Finland's Korp server, is beneficial to the search engine users as well.
On this page, we will evaluate the state of development of analyzers for individual languages in relation to text data being annotated for the Korp search engine. This evaluation will therefore be aligned with the annotation of upcoming corpora, such as a new extended version of Parallel Bible Verses for Uralic Studies (PaBiVUS) version 2 3 . The objective is to increase the lemmatization, morphological and syntactic annotation coverage not previously offered for non-majority languages in the parallel corpus. So, here we will provide an illustrative depiction of each, individual finitestate description. In the more developed descriptions, we will also show steps have been made for improvement. This might be seen as enhanced but not complete coverage of various genre as we go.
The evaluations will tend to illustrate the capacities of the analyzers, which do have equivalent generators, but the possible over-productivity of these generators is presently not the focus of these evaluations. In time, attention will be also drawn towards the description of the disambiguation of morphological analyses, which is made possible in the open-source GiellaLT infrastructure. The enhanced descriptions, housed in GiellaLT, will serve as a contribution by the Language Bank of Finland in the shared responsibilities towards improved coverage of lesser described languages and NLP addressing them. Thus, the resulting analysers will be available for building within the GiellaLT infrastructure or the UralicNLP python 4 , java 5 and .net libraries available through Github or the Language Bank of Finland.
Original language | English |
---|---|
Article number | hal-04828974 |
Journal | HAL open science |
DOIs | |
Publication status | Published - 10 Dec 2024 |
MoE publication type | Not Eligible |
Fields of Science
- 6121 Languages
- parallel corpora
- Kielipankki
- HFST
- GiellaLT
- open-source
- finite-state morphological analysis
- Erzya language
- Moksha language
- Komi-Zyrian language
- Komi-Permyak language
- Veps language
- Hill Mari Language
- Meadow Mari language
- Udmurt language
- Olonets-Karelian language
- Karelian language
- Mansi language
Projects
- 1 Active
-
FIN CLARIA: FIN-CLARIAH - language resources & infra
Linden, K. (Project manager), Tolonen, M. (Project manager), Axelson, E. (Participant), Dieckmann, U. (Participant), Jauhiainen, T. (Participant), Kettunen, H. (Participant), Lennes, M. (Participant), Niemi, J. (Participant), Piitulainen, J. (Participant), Rosson, D. E. (Participant), Rueter, J. (Participant), Turunen, R. J. (Participant), Vaara, V. (Participant), Vesalainen, A. J. K. (Participant) & Wang, R. (Participant)
Academy of Finland, Suomen Akatemia Projektilaskutus
01/01/2024 → 31/12/2025
Project: Research Council of Finland: Research infrastructure
Activities
-
International Conference on Natural Language Processing for Digital Humanities
Hämäläinen, M. (Scientific Committee Member), Öhman, E. (Scientific Committee Member), Miyagawa, S. (Attendee), Alnajjar, K. (Attendee), Bizzoni, Y. (Scientific Committee Member), Wilbur, J. (Scientific Committee Member), Degaetano-Ortlieb, S. (Attendee), Gessler, L. (Scientific Committee Member), Leppänen, L. (Attendee), Duong, Q. Q. (Attendee), Atanassova, I. (Scientific Committee Member), Tuominen, J. (Attendee), Martinc, M. (Attendee), Janicki, M. M. (Attendee), Zhang, S. (Attendee), Pivovarova, L. (Attendee), Dmitrieva, A. (Attendee), Kanner, A. (Attendee), Hjortnæs , N. (Attendee), Cho, W. I. (Scientific Committee Member), Shoemaker, T. (Scientific Committee Member), Manjavacas, E. (Scientific Committee Member), Iwatsuki, K. (Scientific Committee Member), Rubinstein, A. (Scientific Committee Member), Arnold, F. (Scientific Committee Member), Clerice, T. (Scientific Committee Member), Gutehrlé, N. (Attendee), Alqazlan, L. (Scientific Committee Member), Balázs, I. (Scientific Committee Member), Magistry, P. (Scientific Committee Member), Kawasaki, Y. (Scientific Committee Member), Antoniak, M. (Scientific Committee Member), Korre, K. (Scientific Committee Member), Teodorescu, D. (Scientific Committee Member), Dongqi, P. (Scientific Committee Member), Ligeti-Nagy, N. (Scientific Committee Member), Lahnala, A. (Scientific Committee Member), Simmons, G. (Scientific Committee Member), Hulden, V. (Scientific Committee Member), Park, J. (Scientific Committee Member), Sälevä, J. (Scientific Committee Member), Ruskov, M. (Scientific Committee Member), Song, Y. (Scientific Committee Member), Moreira, P. (Scientific Committee Member), Kurzynski, M. (Scientific Committee Member), Liimatta, A. (Attendee), Das, S. (Scientific Committee Member), Eck, S. O. (Scientific Committee Member), Nakajima, E. (Scientific Committee Member), Takagi, N. M. (Scientific Committee Member), Kawamura, K. (Scientific Committee Member), Dang, B. (Scientific Committee Member) & Rueter, J. (Scientific Committee Member)
16 Nov 2024Activity: Participating in or organising an event types › Organisation and participation in conferences, workshops, courses, seminars
-
ERME-PSLA 1950s, New corpora
Rueter, J. (Speaker), Erina, O. (Speaker) & Kabaeva, N. (Speaker)
29 Nov 2024Activity: Talk or presentation types › Oral presentation
File -
International Workshop on Computational Linguistics for Uralic Languages
Rueter, J. (Member of organizing committee)
28 Nov 2024 → 29 Nov 2024Activity: Participating in or organising an event types › Organisation and participation in conferences, workshops, courses, seminars