Outiller l'occitan: nouvelles ressources et lemmatisation

Translated title of the contribution: New resources and lemmatization experiments for Occitan

Research output: Chapter in Book/Report/Conference proceedingConference contributionProfessional

Abstract

This paper presents recent contributions to the creation of NLP tools and resources for Occitan. Several existing resources were modified or adapted, in particular a rule-based tokenizer, a morphosyntactic lexicon and a treebank. These resources were used to train and evaluate neural lemmatization models. As part of these experiments, a large corpus based on Wikipedia (2 million tokens) was POS-tagged and lemmatized. This new resource is shared through Zenodo.
Translated title of the contributionNew resources and lemmatization experiments for Occitan
Original languageFrench
Title of host publicationActes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN) : Volume 1: travaux de recherche originaux - articles longs
EditorsChristophe Servan, Anne Vilnat
Number of pages15
Place of PublicationParis
PublisherAssociation pour le Traitement Automatique des Langues
Publication dateJun 2023
Pages217-231
Publication statusPublished - Jun 2023
MoE publication typeD3 Professional conference proceedings
EventConférence sur le Traitement Automatique des Langues Naturelles - Paris, France
Duration: 5 Jun 20239 Jun 2023
Conference number: 30

Fields of Science

  • 6121 Languages
  • Occitan
  • lemmatisation
  • low-resourced languages

Cite this