Working Towards Digital Documentation of Uralic Languages With Open-Source Tools and Modern NLP Methods

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review


We present our work towards building an infrastructure for documenting endangered languages with the focus on Uralic languages in particular. Our infrastructure consists of tools to write dictionaries so that entries are structured in XML format. These dictionaries are the foundation for rule-based NLP tools such as FSTs. We also work actively towards enhancing these dictionaries and tools by using the latest state-of-the-art neural models by generating training data through rules and lexica.
Original languageEnglish
Title of host publicationProceedings of the Big Picture Workshop
EditorsYanai Elazar, Allyson Ettinger, Norea Kassner, Sebastian Ruder, Noah A. Smith
Number of pages10
Place of PublicationStroudsburg
PublisherThe Association for Computational Linguistics
Publication date2023
ISBN (Electronic)979-8-89176-051-6
Publication statusPublished - 2023
MoE publication typeA4 Article in conference proceedings
EventThe Big Picture Workshop
- , Singapore
Duration: 7 Dec 20237 Dec 2023

Fields of Science

  • 6121 Languages
  • 113 Computer and information sciences
  • Language facilitator

    Trond Trosterud (Consultant), Sjur Moshagen (Consultant), Jack Rueter (Consultant), Lene Antonsen (Consultant), Heli Uibo (Consultant), Ciprian Gerstenberger (Consultant), Marina Fedina (Consultant), Heiki-Jaan Kaalep (Consultant) & Valts Ernstreits (Consultant)

    Aug 2004 → …

    Activity: Consultancy typesConsultancy

Cite this