Lemmatization Experiments on Two Low-Resourced Languages: Low Saxon and Occitan

Research output: Chapter in Book/Report/Conference proceedingConference contributionScientificpeer-review


We present lemmatization experiments on the unstandardized low-resourced languages Low Saxon and Occitan using two machine-learning-based approaches represented by MaChAmp and Stanza. We show different ways to increase training data by leveraging historical corpora, small amounts of gold data and dictionary information, and discuss the usefulness of this additional data. In the results, we find some differences in the performance of the models depending on the language. This variation is likely to be partly due to differences in the corpora we used, such as the amount of internal variation. However, we also observe common tendencies, for instance that sequential models trained only on gold-annotated data often yield the best overall performance and generalize better to unknown tokens.
Original languageEnglish
Title of host publicationTenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023) : Proceedings of the Workshop
EditorsYves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Number of pages11
Place of PublicationStroudsburg
PublisherThe Association for Computational Linguistics
Publication date5 May 2023
ISBN (Electronic)978-1-959429-50-0
Publication statusPublished - 5 May 2023
MoE publication typeA4 Article in conference proceedings
EventWorkshop on NLP for Similar Languages, Varieties and Dialects - Dubrovnik, Croatia
Duration: 5 May 20236 May 2023
Conference number: 10

Fields of Science

  • 6121 Languages
  • 113 Computer and information sciences

Cite this